Tutorial: How to Build Professional Social Network Scraper with Python With a Code Example

Professional Social Network is the most popular social media platform for people to meet one another in a professional setting. I recommend you to read our article, How To Automatically Track Professional Social Network Job Changes and How Salesforge.ai Integrates Rich Prospecting Data With ChatGPT To Automatically Personalize Emails, to illustrate the needs of Professional Social Network automation for scaling up your business.

In this tutorial, I will guide you with code to get Professional Social Network profile details from a list of Professional Social Network Profile URL. While this tutorial focuses only on getting the profile details you need, data processing can be done according to your needs.

This tutorial will show you two ways of doing it:

sequential and asynchronous

Setting up prerequisites

Python 3
A Proxycurl API key

For sequential method:

request

For asynchronous method:

You can create a free API Key for Proxycurl API at Proxycurl's website. However, with the trial, your credits are limited to 10 credits (1 credit for 1 successful profile request).

If you need more credits, it is only $0.01 per profile! Read the introduction to Proxycurl API here and please send an email to hello@nubela.co for inquiries.

Sequential vs Asynchronous

The asynchronous method gives a shorter overall duration than the sequential method as it sends multiple requests at once while the sequential method only sends one request at a time, each request waiting on the previous one to return a response. However, it also depends on the request execution time that varies according to the latency of the server. As such, this tutorial provides both sequential and asynchronous methods to give comparisons of the code between them.

Let's start with viewing the base code from Proxycurl's documentation :


import requests

api_endpoint = 'https://nubela.co/proxycurl/api/Professional Social Network'

Professional Social Network_profile_url = 'https://www.professionalsocialnetwork.com/in/williamhgates'

api_key = '********-****-****-****-************' # your api_key

header_dic = {'Authorization': 'Bearer ' + api_key}

response = requests.get(api_endpoint,

                        params={'url': Professional Social Network_profile_url},

                        headers=header_dic)

print(response.content)  # To get all profile details

print(response.content["first_name"])  # To get profile first name

You can change the "first_name" to other respond key in this Proxycurl documentation

1. Sequential Method


import requests , json 

from requests.exceptions import HTTPError

api_endpoint = 'https://nubela.co/proxycurl/api/Professional Social Network'

api_key = '********-****-****-****-************' # your api_key

header_dic = {'Authorization': 'Bearer ' + api_key}

Professional Social Network_profile_list = [

    'https://www.professionalsocialnetwork.com/in/williamhgates',

    'https://www.professionalsocialnetwork.com/in/melindagates',

    'https://www.professionalsocialnetwork.com/in/owinfrey'

                        ]

def get_profile_details(likedin_profile_url, session):

    url = api_endpoint

    response = None

    try:

        response = session.get(url, params={'url': Professional Social Network_profile_url}, headers=header_dic)

        response.raise_for_status()

        # print(f"Response status ({url}): {response.status_code}")

    except HTTPError as http_err:

        print(f"HTTP error occurred: {http_err}")

    except Exception as err:

        print(f"An error ocurred: {err}")

    response_json = json.loads(response.content)

    return response_json

with requests.Session() as session:

    for Professional Social Network_profile_url in Professional Social Network_profile_list:

        try:

            response = get_profile_details(Professional Social Network_profile_url, session)

            print (response["first_name"]) # To get first_name key

            # print(response) # To get all profile details

            # print()

        except Exception as err:

            print(f"Exception occured: {err}")

Let's breakdown the code.

As usual, we imported the required library.

Then we use all the variables defined before in the base code except the Professional Social Network_profile_url. Instead, we create a new list, Professional Social Network_profile_list, to hold the URLs of the profiles we want to scrape. For now, we will use 3 Professional Social Network profiles for demonstration purposes.

Next, we define get_profile_details function to make a GET request to Proxycurl's Professional Social Network API. We pass in the profile URL as a query param of the request, along with the Authorization header. Then the JSON response is parsed into Python dictionary using json.loads().

Uncomment line with print(f"Response status ({url}): {response.status_code}") to print response status (200 for success). Take note that a request with a 200 response code indicates a successful request and 1 credit is consumed.

The code block under with requests.Session() as session will iterate through the Professional Social Network_profile_list and print the public_identifier.

Uncomment line with print (response) to print all profile details.

2. Asynchronous Method


import aiohttp, asyncio, json

from aiohttp import ClientSession

from urllib.error import HTTPError

api_endpoint = 'https://nubela.co/proxycurl/api/Professional Social Network'

api_key = '********-****-****-****-************' # your api_key

header_dic = {'Authorization': 'Bearer ' + api_key}

Professional Social Network_profile_list = [

    'https://www.professionalsocialnetwork.com/in/williamhgates',

    'https://www.professionalsocialnetwork.com/in/melindagates',

    'https://www.professionalsocialnetwork.com/in/owinfrey'

                        ]

async def get_profile_details_async(Professional Social Network_profile_url, session):

    url = api_endpoint

    response = None

    try:

        response = await session.request(method='GET', url=url, params={'url': Professional Social Network_profile_url}, headers=header_dic)

        response.raise_for_status()

        print(f"Response status ({url}): {response.status}")

    except HTTPError as http_err:

        print(f"HTTP error occurred: {http_err}")

    except Exception as err:

        print(f"An error ocurred: {err}")

    response_json = await response.content.read()

    return json.loads(response_json)

async def run_program(Professional Social Network_profile_url, session):

    try:

        response = await get_profile_details_async(Professional Social Network_profile_url, session)

        print (response["public_identifier"])

        print(response)

        print()

    except Exception as err:

        print(f"Exception occurred: {err}")

        pass

async def run_async():

    async with ClientSession() as session:

        await asyncio.gather(*[run_program(Professional Social Network_profile_url, session) for Professional Social Network_profile_url in Professional Social Network_profile_list])

def main():

    loop = asyncio.get_event_loop()

    loop.run_until_complete(run_async())

    loop.close()

if __name__ == '__main__':

    main()

note: if you got this error on Mac OS:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed

In the terminal, try to run:

pip install --upgrade certifi

If it doesn't work, try to run:

open /Applications/Python\ 3.6/Install\ Certificates.command

Let's breakdown the code.

The async keyword that prepends the function signature tells Python that this function is a coroutine.

The await keyword, such as in response = await session.request(...) and response_json = await response.content.read(), tell that coroutine to suspend execution and give back control to the event loop, while the operation is awaiting finishes.

A coroutine is similar to generators in Python that consume values instead of producing values. It will pause the execution while waiting for the new data.

In our case, it suspends the execution of get_profile_details_async while the request is being performed: await session.request(...). It is suspended again, while the response is being read by the stream reader: await response.content.read() and json.loads(response_json).

Then, we have the run_program coroutine as a wrapper around the pipeline of getting a response from the API, parsing it to JSON, and printing the results on the screen. It awaits the execution of the get_profile_details_async coroutine.

After that, using the asyncio.gather syntax, we tell the program to schedule all the tasks based on the list of coroutines we provided. This is what allows us to execute tasks concurrently.

Lastly, define main function to run run_async function using asyncio.get_event_loop() and run it under if __name__ == '__main__' block.

3. Challenges

There are still many things you can do beyond this tutorial after retrieving the data you need. You can either store all the data first into any type of file you want and do the data processing, or you can filter the data by adding functions to our code before and only saving the information you need, to make your own personalized Professional Social Network automation tool!

Setting up prerequisites

How to get a Proxycurl Professional Social Network API credential

Sequential vs Asynchronous

1. Sequential Method

Let's breakdown the code.

2. Asynchronous Method

Let's breakdown the code.

3. Challenges