Tutorial: How to crawl Professional Social Network with Python and `requests` library with code examples (Part 2)

This is a two-part series on crawling Professional Social Network in scale. In an earlier article, we studied why Professional Social Network is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl Professional Social Network in scale with demo code.

Update 17th June 2020: Proxycurl has released an API for crawling Professional Social Network Profiles for $0.01 per profile, I highly recommend you take a look at their API: https://nubela.co/proxycurl/Professional Social Network

In this tutorial, I will lead you with code to get to a full name of a person's Professional Social Network profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be used to scale to as many asynchronous nodes as you want.

Setting up prerequisites

Python 3
requests
A Proxycurl credential (username and password)

How to get a proxycurl credential

You can request a free trial Proxycurl credential at Proxycurl's website. However, with the trial credential, you are rate limited to 1 request every minute.

If you require a credential with higher rate limits, please send an email to hello@nubela.co. You will be required to pay a trial fee for a trial key with higher rate limits.

Let's start with a Professional Social Network Profile, say Bill Gate's Professional Social Network Profile: https://www.professionalsocialnetwork.com/in/williamhgates/

We will use Proxycurl's browser crawl because Professional Social Network's page requires javascript for the page to be rendered. Let's go into the Python code:

import requests
import json

API_HOSTNAME = 'https://replace.me.with.proxycurl.hostname.com/some_endpoint'
payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG', 'en'},
}
r = requests.post(API_HOSTNAME, auth=HTTPBasicAuth('USER', 'PASSWD'), data=json.dumps(payload))

Let's break this down

In the code snippet above, you are making a Proxycurl request of type browser which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.

The headers parameter in payload dictionary is there to ensure that the returned language is always in English because not all our nodes are located in english speaking countries. This ensures that the LANG header in the request is overwritten with the value of en.

Now that we have crafted the payload, we will send this request off by calling requests.post(). This makes an API request with the HTTP POST method to Proxycurl servers, for which Proxycurl will forward this request to a randomly selected node.

All you have to do now, is wait for a response.

Not all nodes are logged into Professional Social Network. Please retry a few times until you get a positive result.

The page loads, but the page is not rendered

On slower computers or internet connections, the AJAX calls that the javascript scripts that are called when the page loads will take a longer time to complete. And when the page only has 500ms (or half a second) to

Make AJAX requests to populate populate the page
Render the UI elements from those AJAX requests

Then you should expect that results might be incomplete. To solve this problem, we have to increase the value of dom_read_delay_ms from it's default of 500 (ms) to 30000(ms). What this does is that the browser is asked to wait 30seconds after the page has loaded (like JQuery's $(document).ready()).

Modifying payload to include dom_read_delay_ms parameter

payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.professionalsocialnetwork.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG', 'en'},
    'dom_read_delay_ms': 30000
}

(Adding dom_read_delay_ms to the payload)

2. Use BeautifulSoup to extract full name from the HTML

from bs4 import BeautifulSoup

response_dic = r.json()
soup = BeautifulSoup(response_dic['data'])
h1 = soup.find_all("h1", class_="pv-top-card-section__name")[0]
print(h1.text)

Let's break down the code.

In line 1, we import BeautifulSoup module. We use BeautifulSoup to parse the HTML document retrieved from the Proxycurl request, and also to navigate the dom elements to extract relevant data.

In line 4, we unpack the response from requests as a JSON string into a dictionary. The HTML document is contained in data key of the dictionary, so we unpack that and initialize the BeautifulSoup object in line 4.

With the BeautifulSoup object initialized, in line 5, we search the HTML document for a h1 element with a class named pv-top-card-section__name. Because the .find_all() method returns a list, we instantiate the h1 variable with the first result in the list. (There should only be one actually).

Then, the full name of Bill Gates, will be printed out in line 6.

3. Scaling it up (an exercise for the reader)

In steps 1 and 2, we have built a prototype to extract a user's full name from his Professional Social Network profile. But there are a lot more things because you have a full-fledged crawler that is scalable. Here are some suggestions:

Consider using asyncio to launch multiple requests
Noticed that each Proxycurl request takes quite a bit of time, especially so after you increased the dom_read_delay_ms to 30 seconds - which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead you can have Proxycurl callback to a web endpoint that you have setuped with a result. This is what the id in the payload is for. See asynchronous browser crawl document page for more information.
Check for errors and retry! Possible errors include and are not limited to:
Page isn't rendered completely or properly
Professional Social Network is not logged in