The definitive guide to build your own Professional Social Network Profile Scraper for 1M profiles (2022)

Having built the early prototype for [Proxycurl API](https://nubela.co/proxycurl/Professional Social Network) which turns Professional Social Network profiles into JSON, I learnt a little bit about how one might be able to scrape public Professional Social Network profiles in scale. In this tutorial, I will share my experience building a Professional Social Network profile scraper that works in 2022, and I hope you will find it useful.

PS: You can turn Professional Social Network profiles into JSON with [Proxycurl API](https://nubela.co/proxycurl/Professional Social Network).

To put this tutorial in context, we will preface it with the problem of:

How to scrape 1 million Professional Social Network profiles, and then parse the HTML content into structured data?

Breaking down the problem:

How to crawl a million Professional Social Network profiles and fetch their on-page HTML content
How to parse the HTML content from a public Professional Social Network profile to structured data

Part 1: How to scrape 1M public Professional Social Network profiles for HTML code

Before we embark on the quest to scrape a million profiles, let's start with crawling ten profiles. There are only two ways to crawl ten Professional Social Network profiles for scraping:

As a user logged into Professional Social Network. (A "logged in user")
Or, as a user that is not logged into Professional Social Network. (An "anonymous user.")

1A: Accessing Professional Social Network profiles as an anonymous user

It requires luck to access a Professional Social Network profile without being logged into Professional Social Network.

In my experience, you might be able to access the first profile as an anonymous user if you have not recently clicked into any Professional Social Network profiles.

Even if you succeed viewing a public profile anonymously in your first attempt, more likely or not, you will be greeted with the dreaded Authwall on your second profile visit.

What is the Authwall and how do you circumvent it?

The Authwall exists to block web scraping from users who are not logged into Professional Social Network.

If you visit a public profile from a non-residential IP address, such as from a data center IP address, you will get the Authwall.
If you visit a public profile without any cookies in your browser session (aka incognito mode), you will get the Authwall.
If you are visiting a public profile from a non-major browser, you will get the Authwall.
If you are visiting a public profile multiple times, you will get the Authwall.

There are many reasons that you will be greeted with the Authwall when you are crawling anonymously. But there is one way you can reliably bypass it -- crawl Professional Social Network as Googlebot. If you can access a Professional Social Network public profile page from an IP address that belongs to Google, you can consistently fetch an available Professional Social Network profile without the Authwall.

What does an IP address from Google mean?

It is an IP address that resolves reversely to *.googlebot.com. See this Google support page for a clear definition. And no, IP addresses from Google Cloud instances do not work.

But, there is one page on Professional Social Network that you can crawl without restrictions

Put yourself in the shoes of a Professional Social Network executive. What makes you money? Profile data. Which is the Authwall is used to lock up profile data.

What else makes Professional Social Network money? Jobs! Professional Social Network makes money when companies list jobs on Professional Social Network. These companies will return to Professional Social Network again and again if Professional Social Network succeeds at matching great candidates to their job postings.

Job profiles on Professional Social Network are not blocked by the Authwall to maximize page views.

1B: Accessing Professional Social Network profiles logged into Professional Social Network

You and I are probably not Googlers, which means we do not have access to the range of addresses belonging to Googlebot. But there is respite.

You can log into Professional Social Network to reliably access Professional Social Network profiles. However, as tempting as it may be, I highly recommend that you not use your personal Professional Social Network profile to perform a bulk profile crawl for scraping purposes. You do not want your personal Professional Social Network profile to be blocked.

And it will be blocked should you scrape past a certain threshold or when Professional Social Network detects abnormal (automated) behavior in your account.

But yes, log into your Professional Social Network profile, and you can crawl ten profiles with no problems. And that brings me to the next section -- getting from 10 profiles to 1M profiles.

Can I crawl 1M Professional Social Network profiles to scrape by creating many Professional Social Network accounts?

It is only natural to veer towards the belief that you can build a Professional Social Network scraper if you manage a pool of disposable Professional Social Network accounts. You are not wrong. Building a pool of workers with disposable Professional Social Network accounts is indeed a feasible method if and only if humans meticulously manage each Professional Social Network account.

Once you begin automated crawls on any Professional Social Network account, you will start encountering random Recaptcha challenges on accounts that will keep an account locked until they are solved.

Each Professional Social Network account in your scraping pool will also require a unique residential IP address.
The short answer is yes. You can crawl 1M Professional Social Network profiles with many Professional Social Network accounts with residential IP addresses.

Recap: What you need to do to crawl 1M profiles

The first step to scraping is to get HTML code of profiles in scale. In this article, we put a number to "scale." One million profiles. There are only a few ways to crawl 1M Professional Social Network profiles, and they are

Access Professional Social Network from an IP address the resolves as Googlebot
Manage a large pool of workers logged in as individual Professional Social Network account, with each account sitting on residential IP addresses
Use Proxycurl API -- see the next section.

Using Proxycurl API to enrich 1M Professional Social Network profiles

Proxycurl is an offering we built that provides a managed service to turn Professional Social Network profile URLs into structured JSON data.

If you ask me which is the best way to scrape Professional Social Network profiles, then I will tell you in a very biased way to use Proxycurl's API. Specifically, the Person Profile Endpoint. Our Person Profile Endpoint takes a Professional Social Network profile URL and returns you the structured data of the public profile.

Part 2: I have HTML code of a profile page, how do I scrape content off it?

Now that you have 1M profiles, it is time to get the content out of the HTML code into structured data. To convert HTML pages to structured data is what I define as "parsing." Crawling profiles gets you a bunch of pages as HTML code. Parsing turns pages of HTML code into machine-readable structured data, like this:

{
'accomplishment_courses': [],
'accomplishment_honors_awards': [{'description': 'Nanyang Scholarship '
                                                'recognizes students who '
                                                'excel academically, '
                                                'demonstrate strong '
                                                'leadership potential, and '
                                                'possess outstanding '
                                                'co-curricular records.\n',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2015},
                                 'issuer': 'Nanyang Technological University',
                                 'title': 'NANYANG Scholarship'},
                                {'description': 'Awarded to students with '
                                                'exceptional results in '
                                                'Physics and Mathematics',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2015},
                                 'issuer': 'Defence Science & Technology '
                                           'Agency',
                                 'title': 'Young Defence Scientist Programme '
                                          '(YDSP) Academic Award'},
                                {'description': 'An annual competition to '
                                                'encourage the study and '
                                                'appreciation of Physics as '
                                                'well as highlight Physics '
                                                'talent.',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2012},
                                 'issuer': 'Institute of Physics Singapore',
                                 'title': 'Singapore Junior Physics Olympiad '
                                          '(Main Category) Honourable '
                                          'Mention'},
                                {'description': 'Certificate awarded to '
                                                'student who topped the '
                                                'cohort in all aspects of '
                                                'Science.',
                                 'issued_on': {'day': None,
                                               'month': None,
                                               'year': 2010},
                                 'issuer': 'Xinmin Secondary School',
                                 'title': 'Certificate of Excellence - Top '
                                          'in Science'},
                                {'description': None,
                                 'issued_on': {'day': 1,
                                               'month': 9,
                                               'year': 2018},
                                 'issuer': 'Nanyang Technological University',
                                 'title': "Dean's List FY17/18"},
...
'volunteer_work': []}

Two ways to parse content from HTML code

There are two ways to scrape content from the HTML page, and the approach to take depends entirely on how the page is crawled.

Two factors decide which is the best method to use:

Is on-page javascript parsed before the HTML code of the profile page is collected?
Is the profile viewed as an anonymous user or as a user logged into Professional Social Network?

Method matrix for your reference

	Anonymous user	Logged into Professional Social Network
Javascript not rendered	Dom Scraping	Code Chunk Scraping
Javascript is rendered	Dom Scraping	Dom Scraping

Dom parsing

Dom parsing is the standard method that most developers use for web scraping. You can find the data within fixed HTML tags on a page that is loaded and rendered. You can fetch most content of a profile page by transversing HTML tags either via selectors or XPATH.

The problem is that the layout HTML pages are updated often and always. And layout varies according to locale. A profile loaded in Arabic locale will differ in layout from a profile loaded in English. Every time something changes, expect your scraper to break. Dom scraping is a high maintenance method but easy to implement.

Code Chunk Scraping

Code Chunk Scraping is a superior method reserved for profile pages fetched as a logged user; before javascript is rendered. It is a better method because it does not depend on HTML dom structure -- and that means that page layout changes on Professional Social Network will not break this scraping method. What it does instead is that it looks at the data in-page placed within <code></code> tags. These blobs of JSON data are used by Professional Social Network's javascript code to populate the page's dom elements. With the Code Chunk scraping method, you transverse JSON objects instead of Dom elements.

Because the JSON blob data is already stored in a structured manner, we do not have to tokenize strings to re-structure data and return the data as it is. That means you do not need to parse "12th March 2020" into a machine-readable Date object.

To recap: the Code Chunk scraping method

is faster to crawl because you can skip Javascript parsing
breaks less due to on-page layout changes
but, requires you to be logged into Professional Social Network when fetching profiles

Here is an example of data transversal with the Code Chunk Scraping method to return Patents Achievement from a user profile:

    def get_patents(data):
        patent_lis = []
        for dic in Person._type_in_include_rows(data,
                                                'com.Professional Social Network.voyager.dash.identity.profile.Patent'):
            description = dic.get('description')
            application_number = dic.get('applicationNumber')
            issuer = dic.get('issuer')
            issued_on = None
            issued_on_dic = dic.get('issuedOn', {})
            if issued_on_dic:
                issued_on = Date(month=issued_on_dic.get('month'),
                                 day=issued_on_dic.get(
                    'day'),
                    year=issued_on_dic.get('year'))
            patent_number = dic.get('patentNumber')
            title = dic.get('title')
            url = dic.get('url')
            patent_lis += [Patent(description=description,
                                  application_number=application_number,
                                  issuer=issuer,
                                  issued_on=issued_on,
                                  patent_number=patent_number,
                                  title=title,
                                  url=url
                                  )]
        return patent_lis

So you want to build your own Professional Social Network Profile Scraper

In this article, I explained that scraping Professional Social Network profiles is a two-step process.

The first step is to crawl Professional Social Network profiles and save the HTML code for further processing in the second step. The second step is to process the HTML code and turn raw HTML code into structured data that you can use in your application.

There are only two methods to crawl Professional Social Network profiles in scale -- anonymously as Googlebot, or via a pool of workers logged into Professional Social Network with unique residential IP addresses. It is not impossible, but you can get yourself 1M HTML files if you work around these limitations.

The next step is to process these 1M HTML files and turn them into structured data for your application. If you crawled the page without rendering javascript but with an account logged into Professional Social Network, you should use the Code Chunk Scraping method, which is superior because it breaks a lot lesser. Otherwise, you can perform a regular scraping with your favorite Dom transversal library with the Dom Parsing method. (I recommend beautifulsoup4 if you are using Python)

Even if you are a well-funded startup, it is not trivial to crawl Professional Social Network data in scale. You need a secret weapon.

Proxycurl is a managed enrichment service for Professional Social Network profile URLs.

Just like how you have chosen AWS instead of building and colocating your server farms, dataset acquisition is a menial task best left as a managed service. I can only write this article in such detail because of the combined expertise of our entire development team and learned experience over the years.

Why crawl Professional Social Network, when you can purchase an exhaustive Professional Social Network (public) profile dataset loaded with data of Professional Social Network profiles in the US?

Why manage a Professional Social Network profile scraper when you can use [our API and get a Professional Social Network Profile in structured data for $0.01 per profile](https://nubela.co/proxycurl/Professional Social Network)?

I will love to help your business integrate data at the core of your product. Send an email to [email protected] and let me know how I can help you with your data needs! Let Proxycurl be your secret weapon.

The tutorial is not complete without code samples.

In this article, I shared in high-level how you might be able to scrape Professional Social Network profiles in scale. But a tutorial is not complete without code samples. In the follow-up article, I will be releasing fully-working code samples to complement this article. Please subscribe to Proxycurl's mailing list here to be notified of the next article with code samples!