Tutorial: How to crawl LinkedIn with Python and `requests` library with code examples (Part 2)
This is a two-part series on crawling LinkedIn at scale. In an earlier article, we studied why LinkedIn is a hard target to crawl. In this follow-up, I will dive deep into a technical tutorial on how you can crawl LinkedIn at scale with demo code.
In this tutorial, I will lead you with code to get to the full name of a person's LinkedIn profile. While this tutorial focuses on only 1 profile, the method used in this tutorial can be scaled to as many asynchronous nodes as you want.
Setting up prerequisites
- Python 3
requests- A Proxycurl credential (username and password)
How to get a Proxycurl credential
You can request a free trial Proxycurl credential at Proxycurl's website. However, this article is now primarily of historical value because Proxycurl has been sunset.
If your goal today is to get structured B2B data in roughly the same shape, but without scraping LinkedIn and with none of the legal liability, skip ahead to the NinjaPear section appended at the end of this tutorial.
1. Start with LinkedIn profile and make a Proxycurl request
Let's start with a LinkedIn profile, say Bill Gates' LinkedIn profile: https://www.linkedin.com/in/williamhgates/
We will use Proxycurl's browser crawl because LinkedIn's page requires JavaScript for the page to be rendered. Let's go into the Python code:
import json
import requests
from requests.auth import HTTPBasicAuth
API_HOSTNAME = 'https://replace.me.with.proxycurl.hostname.com/some_endpoint'
payload = {
'id': 'bill-gates-crawl-id',
'url': 'https://www.linkedin.com/in/williamhgates/',
'type': 'browser',
'headers': {'LANG': 'en'},
}
r = requests.post(
API_HOSTNAME,
auth=HTTPBasicAuth('USER', 'PASSWD'),
data=json.dumps(payload),
)
Let's break this down
In the code snippet above, you are making a Proxycurl request of type browser, which means it is a browser crawl request. A browser crawl request simulates opening the page in a real browser, with real user sessions.
The headers parameter in the payload dictionary is there to ensure that the returned language is always in English because not all nodes are located in English-speaking countries. This ensures that the LANG header in the request is overwritten with the value en.
Now that we have crafted the payload, we will send this request off by calling requests.post(). This makes an API request with the HTTP POST method to Proxycurl servers, for which Proxycurl will forward this request to a randomly selected node.
All you have to do now is wait for a response.
Proxycurl status update
Proxycurl has been sunset. I am keeping this section intact because the tutorial is still useful as a historical explanation of how browser-based LinkedIn crawling worked. The founder behind Proxycurl now runs NinjaPear instead. NinjaPear does not scrape LinkedIn, and that is very intentional. LinkedIn sued Proxycurl in 2025, and the product was shut down rather than ask customers to sit on top of that risk. If what you really need is person, company, employee, customer, competitor, and monitoring data in production, the safer modern path is NinjaPear, which I cover at the end of this article.
I tried this, but the response is not a proper LinkedIn profile page
Not all nodes are logged into LinkedIn. Please retry a few times until you get a positive result.
The page loads, but the page is not rendered
On slower computers or internet connections, the AJAX calls and JavaScript scripts triggered when the page loads will take longer to complete. And when the page only has 500 ms, or half a second, to:
- make AJAX requests to populate the page
- render the UI elements from those AJAX requests
You should expect incomplete results.
To solve this problem, we have to increase the value of dom_read_delay_ms from its default of 500 ms to 30000 ms. What this does is ask the browser to wait 30 seconds after the page has loaded, much like jQuery's $(document).ready() plus an additional wait.
Modify payload to include the dom_read_delay_ms parameter:
payload = {
'id': 'bill-gates-crawl-id',
'url': 'https://www.linkedin.com/in/williamhgates/',
'type': 'browser',
'headers': {'LANG': 'en'},
'dom_read_delay_ms': 30000,
}
(Adding dom_read_delay_ms to the payload)
2. Use BeautifulSoup to extract full name from the HTML
from bs4 import BeautifulSoup
response_dic = r.json()
soup = BeautifulSoup(response_dic['data'], 'html.parser')
h1 = soup.find_all('h1', class_='pv-top-card-section__name')[0]
print(h1.text)
Let's break down the code.
In line 1, we import the BeautifulSoup module. We use BeautifulSoup to parse the HTML document retrieved from the Proxycurl request, and also to navigate the DOM elements to extract relevant data.
In line 4, we unpack the response from requests as JSON into a dictionary. The HTML document is contained in the data key of the dictionary, so we unpack that and initialize the BeautifulSoup object.
With the BeautifulSoup object initialized, in line 5, we search the HTML document for an h1 element with a class named pv-top-card-section__name. Because the .find_all() method returns a list, we instantiate the h1 variable with the first result in the list. There should only be one.
Then, the full name of Bill Gates will be printed in line 6.
A practical note from 2026: CSS selectors like this are brittle. They break when LinkedIn changes the DOM, and LinkedIn changes the DOM often enough that you should assume maintenance overhead as part of the job, not an exception.
3. Scaling it up, an exercise for the reader
In steps 1 and 2, we built a prototype to extract a user's full name from his LinkedIn profile. But there is a lot more you can do once you have a full-fledged crawler that scales. Here are some suggestions:
- Consider using
asyncioto launch multiple requests. - Notice that each Proxycurl request takes quite a bit of time, especially after increasing
dom_read_delay_msto 30 seconds, which means requests take at least 30 seconds per request. You do not want to keep waiting for responses to return. Instead, you can have Proxycurl callback to a web endpoint that you have set up to receive the result. This is what theidin thepayloadis for. See the asynchronous browser crawl documentation for more information. - Check for errors and retry. Possible errors include, but are not limited to:
- Page is not rendered completely or properly
- LinkedIn is not logged in
If I were doing this for real, I would also add:
- a retry budget per profile ID
- selector versioning, because DOM drift is constant
- response fingerprinting, to quickly detect login walls, bot checks, and partial renders
- queue-based fanout instead of raw fire-and-forget concurrency
That is where these projects get expensive. Fast.
A modern alternative to the old Proxycurl workflow
If your real goal is not "I specifically want to scrape LinkedIn HTML," but rather "I need person and company data in my product," then the better answer in 2026 is not to revive Proxycurl. It is to stop depending on LinkedIn as your canonical source.
That is why NinjaPear exists.
I am appending this section instead of rewriting the whole article because the original tutorial still teaches the browser-crawl mechanics correctly. But for production systems, I would not advise building a new product around LinkedIn crawling unless you are very clear about the maintenance cost and legal exposure.
For the legal side, read this: none of the legal liability.
What changed after Proxycurl
Proxycurl is shut down. The founder behind Proxycurl now runs NinjaPear.
NinjaPear does not scrape LinkedIn. Instead, it builds company and employee intelligence from public web sources, and in practice that means you can still solve many of the same business problems:
- enrich a person profile
- enrich a company profile
- find work emails
- get employee counts
- monitor company updates
- find customers and competitors
The difference is that the canonical object is usually a company website or a public-web identity, not a LinkedIn URL.
NinjaPear endpoints that replace the old use case
If you were using Proxycurl's old LinkedIn-shaped flows, these are the closest NinjaPear replacements:
| Old need | NinjaPear endpoint | What you pass in | What you get back |
|---|---|---|---|
| Person enrichment | Person Profile Endpoint | Work email, name + company, or role + company | Work history, education, location, public profile data |
| Company enrichment | Company Details Endpoint | Company website | Industry, executives, offices, employee count |
| Employee count | Employee Count Endpoint | Company website | Fresh headcount |
| Work email lookup | Work Email Lookup Endpoint | Name + company domain | Verified work email |
| Company monitoring | Company Updates Endpoint | Company website | Blog posts and X updates |
| Customer discovery | Customer Listing Endpoint | Company website | Customers, partners, investors |
A few real numbers from current NinjaPear pricing matter here:
- Person Profile:
3 credits / call - Company Details:
3 credits / call, up to 6 with extra flags - Employee Count:
2 credits / call - Company Updates:
2 credits / call - Work Email Lookup:
2 credits when found,0.5credit on miss - Free trial:
3 days,10 credits, no card required
That is a much cleaner production story than keeping a brittle LinkedIn DOM parser alive.
Example: enrich a person without scraping LinkedIn
If what you wanted from the original article was a structured professional profile, this is the modern shape of the problem.
import requests
API_KEY = 'YOUR_API_KEY'
url = 'https://nubela.co/api/v1/employee/profile'
headers = {
'Authorization': f'Bearer {API_KEY}',
}
params = {
'email': '[email protected]',
}
response = requests.get(url, headers=headers, params=params)
print(response.json())
You are no longer waiting for a browser session to render a LinkedIn page, then parsing brittle HTML. You pass in a stable identifier and get structured JSON back.
That is a better engineering trade in most cases.
Example: enrich a company from its website
import requests
API_KEY = 'YOUR_API_KEY'
url = 'https://nubela.co/api/v1/company/details'
headers = {
'Authorization': f'Bearer {API_KEY}',
}
params = {
'website': 'https://stripe.com',
}
response = requests.get(url, headers=headers, params=params)
print(response.json())
If your downstream workflow is account scoring, routing, prospecting, or CRM enrichment, this usually gets you to value faster than crawling LinkedIn pages ever did.
When this original tutorial is still useful
This tutorial is still useful if you are:
- studying how browser-based crawling pipelines worked
- maintaining legacy infrastructure that still uses LinkedIn page rendering
- debugging older scraping code that relied on HTML selectors
It is not the path I would pick for a new build in 2026.
Get started
If you are maintaining a legacy LinkedIn crawler, the code above still explains the core mechanics.
If you are building a new product, I would skip the crawler and start with NinjaPear instead. Begin with the company website or a work email, use the structured endpoints, and avoid turning LinkedIn's DOM changes into your team's part-time job.