This is a two-part series on crawling Linkedin in scale. In this first part, we study why Linkedin is a hard target to crawl. In the follow-up part two, I will dive deep into a technical tutorial on how you can crawl Linkedin in scale with demo code.
Everybody wants a piece of Linkedin, especially so since they have their data under a tight noose. Companies such as "hiQ Labs" have been sued for circumventing, but alas, the courts ruled that it is perfectly legal for companies to crawl their sites.
Before we move onto a technical guide on how you can crawl Linkedin, let's understand why it is hard to crawl Linkedin in scale:
1. You need to be logged into Linkedin to gain access to content
To view any profiles on Linkedin, you have to be logged in.
3. Linkedin blocks your IP when you crawl too much or too fast
IP address are rate limited if it is used too much.
4. Linkedin blocks your account when you crawl too much or too fast
Linkedin accounts are also rate limited if it is used too much.
Use Proxycurl to crawl Linkedin profiles in scale
To crawl Linkedin in scale, you need 3 conditions to be satisfied:
- You will need many (residential) IP addresses
- You will need many Linkedin accounts (logged in)
Proxycurl's private network of browser-based crawler nodes satisfies these requirements wholly because our proxy network employs tens of thousands of residential users around the world, using the latest bleeding edge browser to provide their computers to assist in our crawling efforts.
In part 2, I will provide a technical tutorial on how you can do this.