Since we launched LinkDB, we have received a barrage of requests for a company profile dataset. We understand that pairing people with companies will help our customers understand questions like:

  1. How many employees does a company have?
  2. What is the makeup of roles in a company?

It is only natural we made crawling company profiles exhaustively a priority. But why work if you can get anything for free? So let's talk about the elephant in the room.

PeopleDataLabs (PDL) offers a "free company dataset."

It is true, PDL offers a free company dataset, and I was curious:

  • Why is PDL offering this dataset for free?
  • How many companies do they have?
  • What fields does this dataset have?
  • Is this dataset any good?

I put my spy hat on, went over to their website, and gave away personal information, including my phone number, and received this email shortly.

The email PDL sent me after I gave them loads of personal information

How many companies does the free Company dataset have?

I picked the CSV dataset dump in the email, and a file named free_company_dataset.csv.zip began to download. I unpacked it, and I ran the following wc command to find out how many lines of companies there are:

$ wc -l free_company_dataset.csv
 12258431 free_company_dataset.csv
Total lines of data in the CSV file

There we have it. PDL's company profile dataset has 12.25M company profiles.

Next, I wanted to find out what fields this dataset have:

$ head free_company_dataset.csv
name,domain,year_founded,industry,size_range,locality,country,linkedin_url,current_company_employee_estimate,total_employee_estimate
(le) poisson rouge,lprnyc.com,,entertainment,51-200,,,linkedin.com/company/-le-poisson-rouge,42,224
nearfox.com,nearfox.com,2015,internet,11-50,,,linkedin.com/company/zip-news,4,43
"mullin landscape associates, llc",,2007,construction,1-10,,,linkedin.com/company/mullin-landscape-associates-llc-,20,27
armatile,armatilearchitectural.com,1975,design,1-10,,,linkedin.com/company/armatile-limited,13,23
chameleon venues,,,marketing and advertising,1-10,,,linkedin.com/company/chameleon-venues,2,5
wagner kirkman blaine klomparens & youmans llp,wkblaw.com,1976,law practice,51-200,,,linkedin.com/company/wagner-kirkman-blaine-klomparens-&-youmans-llp,40,139
skilled engineering limited,,,insurance,1-10,,,linkedin.com/company/skilled-engineering-limited,1,31
gillette management llc,,,consumer goods,1-10,,,linkedin.com/company/gillette-management-llc,0,7
choice wood company,choicecompanies.com,1983,architecture & planning,1-10,,,linkedin.com/company/choice-wood-company,9,37
Peeking into the first few lines in the CSV file. The first line usually contain the column headers.

The column labels of the CSV file are:

  • name
  • domain
  • year_founded
  • industry_size_range
  • locality
  • country
  • Linkedin_url
  • current_company_employee_estimate
  • total_employee_estimate

Not bad. It does have the most important fields, except the timestamp for the last point of update.

How old is PDL's company profile dataset?

profiles, and make statistical inferences.

I extracted the first 999 companies from the dataset, and threw it into a Bulk Linkedin Company scraping script that I opened-sourced here. This script uses Proxycurl's Linkedin Company Profile API endpoint to scrape and enrich a Linkedin Company Profile URL if it is valid.

Out of 999 companies, there were only results for 835 companies.

16.4%, or 164 out of 999 companies provided in the dataset, are not valid on Linkedin.

Extrapolating that, 2,010,382 companies are dead in free PDL's company dataset.

I conclude that this dataset is super old.

Why is PDL offering you an outdated Company Profile Dataset for free?

Because you are an ideal customer interested in big datasets, they can collect personal and contact information about you to further upsell you.

Our turn - 17M companies in Proxycurl's LinkDB, our profile database

What about our dataset?

In January, we commissioned a crawl of all public Linkedin company profiles. I am happy to share that we have 17+M company profiles available now in LinkDB. Proxycurl's Linkedin Company Profile API endpoint was employed to accomplish this feat.

These company profiles were updated just a few days ago and are up-to-date at the point of writing. And they will stay up to date because we will not stop refreshing them.

Fields in Proxycurl's Company Profile Dataset

The following fields represent companies in our dataset:

  1. linkedin_internal_id
  2. description
  3. website
  4. industry
  5. company_size
  6. company_size_on_linkedin
  7. HQ
  8. company_type
  9. founded_year
  10. specialties
  11. locations
  12. name
  13. tagline
  14. universal_name_id
  15. funding_data
  16. search_id
  17. similar_companies
  18. follower_count

Yes, our dataset has a lot more fields.

In summary: Proxycurl VS PeopleDataLabs - Company Profile Dataset

Proxycurl Company Profile Dataset PDL Company Profile Dataset
17M profiles 12.25M profiles
Last updated on 25th January 2021 Last updated many years ago
Standard fields + description, headquarter location, company type, specialities, locations, profile picture, similar companies, linkedin follower count Standard fields
0% DEAD profiles 16.4% DEAD profiles
Monthly data updates No updates

Proxycurl's Global Company Profile Dataset is available now.

  1. Please don't take my word for it. Try it yourself. If you register and log into Proxycurl, you will access LinkDB, our PostgreSQL server, which contains the Proxycurl's Global Company dataset. Make a few queries and sample the data for yourself :)
  2. Yes, we do sell a snapshot of our global company dataset. Keen? Please send me an email to [email protected].