NinjaPear API V2 - Now 400% faster Learn more

Here's a tutorial on how to crawl Professional Social Network using Python and the requests library, complete with code examp
Here's a tutorial on how to crawl Professional Social Network using Python and the requests library, complete with code examples.

Tutorial: How to crawl LinkedIn with Python and `requests` library with code examples (Part 2)

This is a two-part series on crawling LinkedIn at scale. In an earlier article, we studied why LinkedIn is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl LinkedIn at scale with demo code.

In this tutorial, I will lead you with code to get to the full name of a person's LinkedIn profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be scaled to as many asynchronous nodes as you want.

Setting up prerequisites

  1. Python 3
  2. requests
  3. A Proxycurl credential (username and password)

How to get a Proxycurl credential

You can request a free trial Proxycurl credential at Proxycurl's website. However, this article is now primarily of historical value because Proxycurl has been sunset.

If your goal today is to get structured B2B data in roughly the same shape, but without scraping LinkedIn and with none of the legal liability, skip ahead to the NinjaPear section appended at the end of this tutorial.

1. Start with LinkedIn profile and make a Proxycurl request

Let's start with a LinkedIn profile, say Bill Gates' LinkedIn profile: https://www.linkedin.com/in/williamhgates/

We will use Proxycurl's browser crawl because LinkedIn's page requires JavaScript for the page to be rendered. Let's go into the Python code:

import json
import requests
from requests.auth import HTTPBasicAuth

API_HOSTNAME = 'https://replace.me.with.proxycurl.hostname.com/some_endpoint'
payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.linkedin.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG': 'en'},
}
r = requests.post(
    API_HOSTNAME,
    auth=HTTPBasicAuth('USER', 'PASSWD'),
    data=json.dumps(payload),
)

Let's break this down

In the code snippet above, you are making a Proxycurl request of type browser, which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.

The headers parameter in the payload dictionary is there to ensure that the returned language is always in English because not all nodes are located in English-speaking countries. This ensures that the LANG header in the request is overwritten with the value en.

Now that we have crafted the payload, we will send this request off by calling requests.post(). This makes an API request with the HTTP POST method to Proxycurl servers, for which Proxycurl will forward this request to a randomly selected node.

All you have to do now is wait for a response.

Proxycurl status update

Proxycurl has been sunset. I am keeping this section intact because the tutorial is still useful as a historical explanation of how browser-based LinkedIn crawling worked. The founder behind Proxycurl now runs NinjaPear instead. NinjaPear does not scrape LinkedIn, and that is very intentional. LinkedIn sued Proxycurl in 2025, and the product was shut down rather than ask customers to sit on top of that risk. If what you really need is person, company, employee, customer, competitor, and monitoring data in production, the safer modern path is NinjaPear, which I cover at the end of this article.

I tried this, but the response is not a proper LinkedIn profile page

Not all nodes are logged into LinkedIn. Please retry a few times until you get a positive result.

The page loads, but the page is not rendered

On slower computers or internet connections, the AJAX calls and JavaScript scripts triggered when the page loads will take longer to complete. And when the page only has 500 ms, or half a second, to:

  • make AJAX requests to populate the page
  • render the UI elements from those AJAX requests

You should expect incomplete results.

To solve this problem, we have to increase the value of dom_read_delay_ms from its default of 500 ms to 30000 ms. What this does is ask the browser to wait 30 seconds after the page has loaded, much like jQuery's $(document).ready() plus an additional wait.

Modify payload to include the dom_read_delay_ms parameter:

payload = {
    'id': 'bill-gates-crawl-id',
    'url': 'https://www.linkedin.com/in/williamhgates/',
    'type': 'browser',
    'headers': {'LANG': 'en'},
    'dom_read_delay_ms': 30000,
}

(Adding dom_read_delay_ms to the payload)

2. Use BeautifulSoup to extract full name from the HTML

from bs4 import BeautifulSoup

response_dic = r.json()
soup = BeautifulSoup(response_dic['data'], 'html.parser')
h1 = soup.find_all('h1', class_='pv-top-card-section__name')[0]
print(h1.text)

Let's break down the code.

In line 1, we import the BeautifulSoup module. We use BeautifulSoup to parse the HTML document retrieved from the Proxycurl request, and also to navigate the DOM elements to extract relevant data.

In line 4, we unpack the response from requests as JSON into a dictionary. The HTML document is contained in the data key of the dictionary, so we unpack that and initialize the BeautifulSoup object.

With the BeautifulSoup object initialized, in line 5, we search the HTML document for an h1 element with a class named pv-top-card-section__name. Because the .find_all() method returns a list, we instantiate the h1 variable with the first result in the list. There should only be one.

Then, the full name of Bill Gates will be printed in line 6.

A practical note from 2026: CSS selectors like this are brittle. They break when LinkedIn changes the DOM, and LinkedIn changes the DOM often enough that you should assume maintenance overhead as part of the job, not an exception.

3. Scaling it up, an exercise for the reader

In steps 1 and 2, we built a prototype to extract a user's full name from his LinkedIn profile. But there is a lot more you can do once you have a full-fledged crawler that scales. Here are some suggestions:

  • Consider using asyncio to launch multiple requests.
  • Notice that each Proxycurl request takes quite a bit of time, especially after increasing dom_read_delay_ms to 30 seconds, which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead, you can have Proxycurl callback to a web endpoint that you have set up to receive the result. This is what the id in the payload is for. See the asynchronous browser crawl documentation for more information.
  • Check for errors and retry. Possible errors include, but are not limited to:
  • Page is not rendered completely or properly
  • LinkedIn is not logged in

If I were doing this for real, I would also add:

  • a retry budget per profile ID
  • selector versioning, because DOM drift is constant
  • response fingerprinting, to quickly detect login walls, bot checks, and partial renders
  • queue-based fanout instead of raw fire-and-forget concurrency

That is where these projects get expensive. Fast.

A modern alternative to the old Proxycurl workflow

If your real goal is not "I specifically want to scrape LinkedIn HTML," but rather "I need person and company data in my product," then the better answer in 2026 is not to revive Proxycurl. It is to stop depending on LinkedIn as your canonical source.

That is why NinjaPear exists.

I am appending this section instead of rewriting the whole article because the original tutorial still teaches the browser-crawl mechanics correctly. But for production systems, I would not advise building a new product around LinkedIn crawling unless you are very clear about the maintenance cost and legal exposure.

For the legal side, read this: none of the legal liability.

What changed after Proxycurl

Proxycurl is shut down. The founder behind Proxycurl now runs NinjaPear.

NinjaPear does not scrape LinkedIn. Instead, it builds company and employee intelligence from public web sources, and in practice that means you can still solve many of the same business problems:

  • enrich a person profile
  • enrich a company profile
  • find work emails
  • get employee counts
  • monitor company updates
  • find customers and competitors

The difference is that the canonical object is usually a company website or a public-web identity, not a LinkedIn URL.

NinjaPear endpoints that replace the old use case

If you were using Proxycurl's old LinkedIn-shaped flows, these are the closest NinjaPear replacements:

Old need NinjaPear endpoint What you pass in What you get back
Person enrichment Person Profile Endpoint Work email, name + company, or role + company Work history, education, location, public profile data
Company enrichment Company Details Endpoint Company website Industry, executives, offices, employee count
Employee count Employee Count Endpoint Company website Fresh headcount
Work email lookup Work Email Lookup Endpoint Name + company domain Verified work email
Company monitoring Company Updates Endpoint Company website Blog posts and X updates
Customer discovery Customer Listing Endpoint Company website Customers, partners, investors

A few real numbers from current NinjaPear pricing matter here:

  • Person Profile: 3 credits / call
  • Company Details: 3 credits / call, up to 6 with extra flags
  • Employee Count: 2 credits / call
  • Company Updates: 2 credits / call
  • Work Email Lookup: 2 credits when found, 0.5 credit on miss
  • Free trial: 3 days, 10 credits, no card required

That is a much cleaner production story than keeping a brittle LinkedIn DOM parser alive.

Example: enrich a person without scraping LinkedIn

If what you wanted from the original article was a structured professional profile, this is the modern shape of the problem.

import requests

API_KEY = 'YOUR_API_KEY'
url = 'https://nubela.co/api/v1/employee/profile'
headers = {
    'Authorization': f'Bearer {API_KEY}',
}
params = {
    'email': '[email protected]',
}

response = requests.get(url, headers=headers, params=params)
print(response.json())

You are no longer waiting for a browser session to render a LinkedIn page, then parsing brittle HTML. You pass in a stable identifier and get structured JSON back.

That is a better engineering trade in most cases.

Example: enrich a company from its website

import requests

API_KEY = 'YOUR_API_KEY'
url = 'https://nubela.co/api/v1/company/details'
headers = {
    'Authorization': f'Bearer {API_KEY}',
}
params = {
    'website': 'https://stripe.com',
}

response = requests.get(url, headers=headers, params=params)
print(response.json())

If your downstream workflow is account scoring, routing, prospecting, or CRM enrichment, this usually gets you to value faster than crawling LinkedIn pages ever did.

When this original tutorial is still useful

This tutorial is still useful if you are:

  • studying how browser-based crawling pipelines worked
  • maintaining legacy infrastructure that still uses LinkedIn page rendering
  • debugging older scraping code that relied on HTML selectors

It is not the path I would pick for a new build in 2026.

Get started

If you are maintaining a legacy LinkedIn crawler, the code above still explains the core mechanics.

If you are building a new product, I would skip the crawler and start with NinjaPear instead. Begin with the company website or a work email, use the structured endpoints, and avoid turning LinkedIn's DOM changes into your team's part-time job.

Steven Goh | CEO
World's laziest CEO. CEO of NinjaPear. Ex-Founder of Proxycurl (10+M), Steven founded 5 other startups: Gom VPN, Kloudsec, SilvrBullet, NuMoney, and SharedHere.

Featured Articles

Here's what we've been up to recently.

I dismissed someone, and it was not because of COVID19

The cadence of delivery. Last month, I dismissed the employment of a software developer who oversold himself during the interview phase. He turned out to be on the lowest rung of the software engineers in my company. Not being good enough is not a reason to be dismissed. But not

sharedhere

I got blocked from posting on Facebook

I tried sharing some news on Facebook today, and I got blocked from posting in other groups. I had figured that I needed a better growth engine instead of over-sharing on Facebook, so I spent the morning planning the new growth engine. Growth Hacking I term what I do in