Welcome to (or back to) our "Ultimate Guide to" series, in which we discuss different aspects of scraping data from LinkedIn! This time, our topic is: Company Data APIs. There are many reasons you might need a Company Data API, but at its core, a Company Data API is an enrichment service. This means that you have some identifier of a company, and the API will (1) identify what company you're asking for and then (2) give you a lot of data about this company.
In case you want to jump right to one of the top-level sections, here's a list:
- What should my Company Data API be able to do?
- Comparing Company Data APIs
- Python examples
- Conclusion
As you can see, there's quite a bit to cover this time, so let's get right into it.
What should my Company Data API be able to do?
Completeness of data
The first thing to consider is the completeness of the data. This quality is important because:
- Even if you think one solution has every field you need, additional use cases calling for other fields may arise in the future.
- You only want to use one product, so the first one you use should be complete.
Here is a list of company data points that you should consider when evaluating Company Data APIs:
- Employee data
- Funding data
- Industry (note: on LinkedIn, this is a specific term)
- Employee count
- Location (country, region, city)
- Key people
- Contact information
- News
- Links to various social media profiles
- Job listings
If a particular solution is missing several of these, you should seriously think twice before committing to it.
How can you resolve a particular company?
As this section title says, how can you resolve a particular company? A Company Data API is an enrichment API, so the set of data you already have about the company you're looking up may vary between use cases. For example, one day, you may have a set of company URLs and need to pull contact emails; another day, you may have a set of contact emails and need to pull location, industry, and funding data to see if they meet your ICP (ideal customer profile).
Here are some possible methods to look up a company that you might want your Company Data API of choice to support:
- Search functionality (i.e. you supply keywords and then receive a list of matching companies).
- Nearest-match resolution based on a parameter like company name or website (for example, see the Proxycurl Company Lookup API).
- LinkedIn company profile URL resolution into complete JSON data.
Data freshness
How fresh is the data that you're getting? You will want relatively fresh data if you're looking for something like job postings. On the other hand, if you're trying to resolve a company URL to a company name, using cached data is probably fine.
Before committing to a provider, ensure you understand their data freshness guarantee. Read their documentation to see whether they allow you to control whether you're reading from cache or not with a query parameter. If so, then ensure you're setting this appropriately.
Legal compliance
This one is, in a sense, the least interesting to talk about, so we saved it for last, but it's also the most important - make sure that the provider you're using takes legal compliance seriously. We've written a number of articles on the topic, and as of this publication, our most recent on the legality of scraping LinkedIn is updated just a couple of weeks ago.
In a nutshell:
- Yes, you can scrape LinkedIn using public profiles only.
- No, you cannot scrape LinkedIn with fake logged-in accounts.
Please do not sign up for any service that doesn't tell you what it does to ensure legal compliance. You wouldn't want a codebase that depends entirely on one provider only to find out that this provider was in court against LinkedIn because it didn't do due diligence.
Comparing Company Data APIs
With all of the above factors in mind, let's now compare a few Company Data APIs:
- Coresignal
- Proxycurl Company API
- TheCompanies API
- People Data Labs
- UpLead Company Information API
Coresignal
We published a full review of Coresignal last year, and you can see our full thoughts there. With Coresignal, you can get data from several platforms, and the most significant point in its favor is its historical job postings data set, which few other platforms invest in maintaining. If your use case requires such a thing, you may want to choose Coresignal. However, their data, in general, is not fresh (3-4 months old), and they don't provide contact information, which is much more likely to be relevant.
Pros
- Historical job posting data set (if you need this, otherwise irrelevant).
- Data from multiple platforms.
Cons
- No Search API.
- Stale data (reportedly can be 3-4 months out of date).
- No contact information.
TheCompanies API
When you go to TheCompanies API's homepage, you're greeted by a cheerful and busy homepage that tells you it's "The Enrichment API to Search 54 Million Companies & Their Employees." You can also try it for free, no credit card required. With such a specialized offering, you'd think the first part of this claim would be their appeal, but it's actually the second part that makes them stand out. If you don't want bulk volumes, they have a generous free tier.
Pros
- Generous free tier that resets monthly. If you don't want to spend any money, you can go with them.
- Has a Search API, and it's not Elasticsearch.
Cons
- Their docs are not developer-friendly. If you aren't a developer, this might be a pro, as they're written with non-developers in mind, but I'm writing this article from the POV of a developer, and I found their docs very hard to use. They take every opportunity to handhold non-developers while obfuscating things for developers, for example hiding the way you include your API key.
- Companies are indexed by domain name, and not something a bit more static like LinkedIn profile URL. So any time a company changes its domain name, their URI per TheCompanies API will change.
- No funding, news, or job posting data.
- Data returned can be quite stale - four months or more.
Proxycurl
Proxycurl is a B2B platform targeting primarily large enterprise customers. It focuses on having fresh data scraped from LinkedIn and enriched from other sources, including Crunchbase, while remaining fully compliant. If you aren't a developer or don't have a dev team working for you, Proxycurl probably shouldn't be your choice, but if you do, you can't get better than a product built with other devs in mind.
Pros
- Rich data set, including every data attribute we've listed in this article.
- Built and documented by developers, with developers as the intended audience.
- If you are looking for volume, it costs as little as $0.009 per profile.
Cons
- Search is available, and a REST API, but it launched recently (as of posting this article). You may have to wait a bit for some features to be released, and performance improvements are still being pushed.
- No historical job posting data, so you may need to look elsewhere if you need this.
People Data Labs
We also have a full review of People Data Labs. Please note that this article is almost a year out of date as of the publication of this article, and Proxycurl does in fact now have a Search API.
Pros
- Has a Search API (unfortunately, it's Elasticsearch-based; good luck learning that).
Cons
- Serves stale profile data.
- Expensive - prices vary widely from $0.03 per Company profile all the way up to approximately $0.20 if you need to look up people within a company. With Proxycurl, if you're on a volume plan, both company and profile results start at $0.009 per result returned.
UpLead Company Information API
UpLead's primary offering isn't their API; it's a GUI tool that's like a more specialized version of the LinkedIn Sales Navigator. If you're here only for the API, you're paying for many services you're not using. And you're paying a lot - this service is really expensive. That said, we'll look at their Company Data API offering in case you're interested.
Pros
- High rate limit of 500 requests per minute - although this is not a rate limit you'll want to reach unless you have a very large budget.
- If you're not in the market for an API but rather a GUI tool, you might be interested in this option.
Cons
- No job postings or news.
- Even on an annual plan, profiles are extremely expensive - if you commit to volume, you will spend $0.30 per profile returned. And if you run out of credits (12,000), the price goes up to $0.40 per result. This service is not one that you use if you want scale. And if you don't want scale? Then you're starting out at over $0.43 per profile, and if you run out of credits on this plan, it'll go up to $0.60 per profile.
Python examples
Phew, that was a lot of text! You've maybe formed some opinions by now based on your particular use case, but most likely, you just want to start writing code. In this section, we'll go over some working code samples for several use cases with the Proxycurl API.
In case you want to skip directly to the use case that most interests you, here's a mini table of contents:
- Retrieve employee data
- Enrich a company profile from a LinkedIn URL
- Look up a company based on incomplete information
- Search for companies from parameters
- Monitor employee count
- Access historical employee count trends
- Receive company updates
- Identify key people within a company
You may also be interested in a link directly to the Proxycurl Company API documentation.
Retrieve employee data
We will use the Employee Listing Endpoint to query a list of current US employees, sorted by recently joined. Other options for this endpoint include:
- Setting
sort_by
torecently-left
or leaving it blank (do not sort). - Setting
employment_status
topast
orall
. - Restricting our results with a
role_search
(we will discuss this later on).
Here's the code sample:
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company/employees/'
params = {
'url': 'https://www.linkedin.com/company/stripe',
'country': 'us',
'employment_status': 'current',
'sort_by': 'recently-joined',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
Here is the output. Please note that for this and all outputs containing people, we are redacting any actual people's profiles and replacing them with celebrities' profiles.
{
"employees": [
{ "profile_url": "https://www.linkedin.com/in/satyanadella", "profile": null },
{ "profile_url": "https://www.linkedin.com/in/williamhgates", "profile": null },
],
"next_page": "https://nubela.co/proxycurl/api/linkedin/company/employees/?url=SOMELONGSTRINGHERE"
}
Enrich a company profile from a LinkedIn URL
For this task, the most appropriate endpoint is the Company Profile Endpoint. However, it's worth noting that several other endpoints, such as the Company Lookup Endpoint, which we'll go over in the next section, contain a parameter called enrich_profiles
. When set to enrich_profiles=enrich
, this parameter automatically enriches each resulting profile with a default set of available data for you, thus saving some coding and roundtrips to the server and back.
For now, though, let's look at the Company Profile Endpoint and a sample response.
The default field set from this endpoint contains data from LinkedIn; beyond that, you can enrich it further with:
categories
- these are retrieved from Crunchbase and are a set of freeform data, as opposed to LinkedIn's industry data, which is strictly curated as an enumerator of 757 values.funding_data
- self-explanatory.extra
- details such as Facebook account, Twitter account, IPO status, etc.exit_data
- a list of investment portfolio exits.acquisitions
- further enriched data on acquisitions made by this company from external sources.
Each of these parameters costs an extra credit to obtain, so only add them to your request if you need them, but they can be an invaluable source of information when you do.
Here's sample code to enrich any data you might have about Google:
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company'
params = {
'url': 'https://www.linkedin.com/company/google/',
'resolve_numeric_id': 'true',
'categories': 'include',
'funding_data': 'include',
'extra': 'include',
'exit_data': 'include',
'acquisitions': 'include',
'use_cache': 'if-present',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
In this sample response, I've deleted every item past the first in a list for brevity (so, for example, there is only one "similar company" instead of 10). To give you an idea of how much data I trimmed to keep this article readable: this response is currently 124 lines long. It started as 1120 lines of data!
{
"linkedin_internal_id": "1441",
"description": "A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.\n\nCheck out our career opportunities at careers.google.com.",
"website": "https://goo.gle/3m1IN7m",
"industry": "Software Development",
"company_size": [10001, null],
"company_size_on_linkedin": 328170,
"hq": {
"country": "US",
"city": "Mountain View",
"postal_code": "94043",
"line_1": "1600 Amphitheatre Parkway",
"is_hq": true,
"state": "CA"
},
"company_type": "PUBLIC_COMPANY",
"founded_year": null,
"specialities": ["search"],
"locations": [
{
"country": "US",
"city": "Mountain View",
"postal_code": "94043",
"line_1": "1600 Amphitheatre Parkway",
"is_hq": true,
"state": "CA"
}
],
"name": "Google",
"tagline": null,
"universal_name_id": "google",
"profile_pic_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/company/google/profile?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20230414%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20230414T060849Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=a05c79199c57cc68e2a68cc1b5a44db3ed42518c157d17826b9abea587b5175d",
"background_cover_image_url": "https://s3.us-west-000.backblazeb2.com/proxycurl/company/google/cover?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=0004d7f56a0400b0000000001%2F20230414%2Fus-west-000%2Fs3%2Faws4_request&X-Amz-Date=20230414T060849Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=a594af628f299fdd13cf76ef3fae3aea24da2f5cf099ceef74c0fae7baa5f468",
"search_id": "1441",
"similar_companies": [
{
"name": "Amazon",
"link": "https://www.linkedin.com/company/amazon",
"industry": "Software Development",
"location": "Seattle, WA"
}
],
"affiliated_companies": [
{
"name": "YouTube",
"link": "https://www.linkedin.com/company/youtube",
"industry": "Software Development",
"location": "San Bruno, CA"
}
],
"updates": [
{
"article_link": null,
"image": "https://media.licdn.com/dms/image/D4E22AQEJZiJ65JRHtA/feedshare-shrink_2048_1536/0/1680546281929?e=1683763200&v=beta&t=TvO1LlJrYf1YRl6DroU6zQwonLjTx6oESZ1fmoXNGRY",
"posted_on": { "day": 4, "month": 4, "year": 2023 },
"text": "Mike Darling, an audience development editor, loves a challenge and recently put himself and the Google Pixel Watch to the test by running the Tokyo Marathon! \n\nMike concluded a decade-long goal to run all six Abbot World Marathon Majors. During the height of his training, he brought the Google Pixel Watch along to track 250,000 steps, and in the process, learned a bit about the importance of rest. Learn more about Mike and the Google Pixel Watch #LifeAtGoogle \u2192 https://goo.gle/3lWrXwV",
"total_likes": 2284
}
],
"follower_count": 28315301,
"social_networking_services": [
{ "service": "facebook", "canonical_url": "facebook.com/google", "internal_id": null },
{ "service": "twitter", "canonical_url": "twitter.com/google", "internal_id": null }
],
"acquisitions": {
"acquired": [
{
"linkedin_profile_url": "https://www.linkedin.com/company/4794728",
"crunchbase_profile_url": "https://www.crunchbase.com/organization/appsheet",
"announced_date": { "day": 15, "month": 1, "year": 2020 },
"price": null
}
],
"acquired_by": null
},
"exit_data": [
{
"linkedin_profile_url": "https://www.linkedin.com/company/23andme",
"crunchbase_profile_url": "https://www.crunchbase.com/organization/23andme",
"name": "23andMe"
}
],
"extra": {
"ipo_status": "Private",
"crunchbase_rank": 950422,
"founding_date": { "day": 1, "month": 1, "year": 2000 },
"operating_status": "Closed",
"company_type": "For Profit",
"contact_email": null,
"phone_number": null,
"facebook_id": "google",
"twitter_id": null,
"number_of_funding_rounds": 0,
"total_funding_amount": null,
"stock_symbol": null,
"ipo_date": null,
"number_of_lead_investors": 0,
"number_of_investors": 0,
"total_fund_raised": 0,
"number_of_investments": null,
"number_of_lead_investments": 0,
"number_of_exits": null,
"number_of_acquisitions": null
},
"funding_data": [
{
"funding_type": "Angel Round",
"money_raised": 1000000,
"announced_date": { "day": 1, "month": 11, "year": 1998 },
"number_of_investor": 4,
"investor_list": [
{ "linkedin_profile_url": null, "name": "Andy Bechtolsheim", "type": "person" },
{ "linkedin_profile_url": null, "name": "David Cheriton", "type": "person" }
]
}
],
"categories": [
"advertising-6cb6",
"collaboration",
"enterprise-software",
"information-technology-dbca",
"search-engine-0d39"
]
}
Look up a company based on incomplete information
There is an entire endpoint dedicated to this, the Company Lookup Endpoint. Perhaps you were expecting this since I told you to expect it last section.
In this example, we'll try to resolve Accenture in two ways, first based on their URL and then on their name. Hopefully, we'll get the same response both times!
Here's sample code when we specify company_domain
:
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company/resolve'
params = {
'company_domain': 'accenture.com',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
And the response is:
{"url": "https://www.linkedin.com/company/accenture"}
And if we change our params
dict to this:
params = {
'company_name': 'Accenture',
}
Then we should get the same thing. Let's run it, and...
{"url": "https://www.linkedin.com/company/accenture"}
Yep!
Now, if we wanted to, we could specify this:
params = {
'company_domain': 'accenture.com',
'enrich_profile': 'enrich',
}
We didn't because we figured you'd had enough of hundred-line-long JSONs for this article. But it's totally an option, and if you want to enrich your data at the same time as you fetch it, you absolutely can.
Search for companies from parameters
We recently released a brand-new Search API! You can check out that entire article for a deep dive on Search, but we'll go over a brief example of the Company Search Endpoint from that article here as well.
Say you're searching for companies that match your ICP (ideal customer profile). In this case, we'll target Medical Device Manufacturing companies founded since 2018, with at most 1250 employees, and which are located in the US.
Here's the Python code:
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/search/company'
params = {
'type': 'PRIVATELY_HELD',
'founded_after_year': '2018',
'employee_count_max': '1250',
'industry': 'Medical Equipment Manufacturing',
'country': 'US',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
And here's a sample response. Keep in mind that Search responses aren't sorted or deterministic, so if you run this yourself, you may get a different result.
{
"results": [
{ "linkedin_profile_url": "https://www.linkedin.com/company/medcorp-global-ltd" },
{ "linkedin_profile_url": "https://www.linkedin.com/company/hygienic-labs-llc" }
],
"next_page": "https://nubela.co/proxycurl/api/search/SOMELONGSTRINGHERE"
}
Monitor employee count
We will use the aptly-named Employee Count Endpoint. This endpoint does quite a bit for something so specific. You can configure:
- Whether to include the
linkedin_employee_count
- if set toinclude
, you will get not only the cachedlinkdb_employee_count
, a count of the employees present on Proxycurl's LinkDB database but also the live scraped number from LinkedIn. - The
employment_status
- we want the count ofcurrent
employees for this use case. But maybe you're interested in the number ofpast
employees, orall
employees who have ever worked at this company? Either way, this endpoint has your back.
Here's Python code that you can run using a task scheduling service to monitor a company's employee count.
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company/employees/count'
params = {
'url': 'https://www.linkedin.com/company/apple/',
'use_cache': 'if-present',
'linkedin_employee_count': 'include',
'employment_status': 'current',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
And here's the response:
{
"linkedin_employee_count": 255413,
"linkdb_employee_count": 96700
}
Access historical employee count trends
This example is a bit too complicated to fit into a single section. In fact, we published an entire article on accessing historical employee count trends! If you want to get straight to the code, you can go directly to the repository. Still, I'd recommend checking out that entire article; it's pretty interesting.
The endpoints used here are:
Receive company updates
If you've browsed the Proxycurl API docs, you might have noticed there is no endpoint called "Company Update Endpoint," so you might be thinking: How can we possibly receive company updates? However, if you paid close attention to the response from the Company Profile Endpoint, you will have noticed that therein lies the answer!
The Company Profile Endpoint contains a field called updates
that collates updates from various sources, including LinkedIn and the company's own website, if available. Let's look at a query and result for Google.
For this query, we'll certainly want to use the use_cache=if-recent
parameter, and we don't need any of the extras, so we've adjusted our example from the Company Profile example a bit.
Here's the code (notice that we're printing only the updates
field):
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company'
params = {
'url': 'https://www.linkedin.com/company/google/',
'use_cache': 'if-recent',
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json()['updates'])) # json.dumps is only for formatting
And here's (the first three entries of) the result:
[
{
"article_link": null,
"image": "https://media.licdn.com/dms/image/D4E22AQEJZiJ65JRHtA/feedshare-shrink_2048_1536/0/1680546281929?e=1683763200&v=beta&t=TvO1LlJrYf1YRl6DroU6zQwonLjTx6oESZ1fmoXNGRY",
"posted_on": { "day": 4, "month": 4, "year": 2023 },
"text": "Mike Darling, an audience development editor, loves a challenge and recently put himself and the Google Pixel Watch to the test by running the Tokyo Marathon! \n\nMike concluded a decade-long goal to run all six Abbot World Marathon Majors. During the height of his training, he brought the Google Pixel Watch along to track 250,000 steps, and in the process, learned a bit about the importance of rest. Learn more about Mike and the Google Pixel Watch #LifeAtGoogle \u2192 https://goo.gle/3lWrXwV",
"total_likes": 2284
},
{
"article_link": "https://blog.google/technology/ai/try-bard/",
"image": null,
"posted_on": { "day": 23, "month": 3, "year": 2023 },
"text": "Today we're starting to open up access to Bard, our early experiment that lets you collaborate with generative AI. You can use Bard to boost your productivity, accelerate your ideas and fuel your curiosity. We're beginning with the U.S. + U.K. and expanding over time to more countries and languages. Learn more and sign up. https://goo.gle/3naF4e3",
"total_likes": 16772
},
{
"article_link": "https://blog.google/technology/ai/ai-developers-google-cloud-workspace/",
"image": null,
"posted_on": { "day": 16, "month": 3, "year": 2023 },
"text": "We\u2019re bringing the power of generative AI to more people, developers and businesses, helping you create and collaborate in Google Workspace, build with our AI models on our open cloud platform Google Cloud, and more. Read the Keyword post from Thomas Kurian, CEO of Google Cloud: https://goo.gle/4087LXk",
"total_likes": 3972
}
]
Identify key people within a company
Once again we are using the Employee Listing Endpoint, and this query is similar to the retrieval of employee data from earlier, only we will now make use of the role_search
parameter.
Here is the code sample:
import json, os, requests
api_key = os.environ['PROXYCURL_API_KEY']
headers = {'Authorization': 'Bearer ' + api_key}
api_endpoint = 'https://nubela.co/proxycurl/api/linkedin/company/employees/'
params = {
'url': 'https://www.linkedin.com/company/discord',
'country': 'us',
'employment_status': 'current',
'role_search': '\\bdirector\\b'
}
response = requests.get(api_endpoint, params=params, headers=headers)
print(json.dumps(response.json())) # json.dumps is only for formatting
Here is the output. Remember, these profiles have been anonymized to be celebrity accounts. If you run the query yourself, you will get different data.
{
"employees": [
{ "profile_url": "https://www.linkedin.com/in/satyanadella", "profile": null },
{ "profile_url": "https://www.linkedin.com/in/williamhgates", "profile": null },
],
"next_page": "https://nubela.co/proxycurl/api/linkedin/company/employees/?url=SOMELONGSTRINGHERE"
}
Conclusion
As you can see, there's no silver bullet when it comes to a "Company Data API." Depending on your specific use case, you might be led all over the place to one of many different solutions. Most of these have some critical flaws, such as a high cost, stale data, and/or lacking an important feature you might need, like the ability to enrich with contact information. This is because they target a hyper-specific use case and are unlikely to cater to your needs if you are searching for a generalized solution.
Fortunately, Proxycurl understands the needs of developers and has packaged everything you could possibly want, with fresh data from multiple sources, including LinkedIn company profiles, Crunchbase profiles, and more. Even better, it provides them to you in a single, easy-to-use API - look at our code samples above! Have questions? Reach out to us by emailing [email protected] and we will get back to you! Can't wait to get started? You'll receive free credits upon registering for a Proxycurl account so start now!