Using Proxycurl's Historic LinkedIn Employee Count Tool for Investment Research

What is Proxycurl's Historic LinkedIn Employee Count Tool, and why is it useful?

Proxycurl's Historic LinkedIn Employee Count Tool counts employees during historical snapshots and provides a trend line dating as far back as the user wants (with the caveat that it's only accurate as long as there are no major statistical confounding factors, such as the company being in stealth). This might already sound awesome to you! But let's talk about why you might want to use this tool.

First of all, we have to understand that LinkedIn employee data is a good estimate for total employee data: We're mostly interested in the tech sector in the English-speaking world, where LinkedIn is the ubiquitous social media platform for sharing employment status. And we won't be looking back more than a few years in time, so LinkedIn's status as such won't be in question. We'll also apply common sense to the results; if our trend line goes down to 0 and news stories said that the company in question was in stealth but existed, we'll believe the news stories, not our trend line.

Example: HireEZ

Let's now look at an example. Here's a graph of HireEZ's extrapolated employee counts. What can we learn from it?

A graph of HireEZ's employee counts between March 2020 & February 2023

A couple things stand out:

  • HireEZ had a huge period of growth in headcount from about December 2020 - February 2021, continuing even through July 2021.
  • HireEZ had another burst of hiring around February 2022.
  • The recent tech layoffs have not left them alone, and they've had a downturn in headcount since October 2022.

Without this data, what methods might we use to look for the health of a company? If we were investors in HireEZ, we would have access to investor reports, but we're outsiders. Well then, we can ask, what are investors doing? Let's look at HireEZ's funding rounds. In August 2020, HireEZ (then known as Hiretual) raised a $13 million Series B round, and in February 2022, they raised another $26 million.

These dates don't correlate exactly to our data - it looks like they started hiring a bit after the first funding round that we saw - but it's a good estimate, and it looks like they used these injections of capital to fund hiring.

A quick search doesn't pull up results for "hireez layoffs," but we can make inferences. Is this company doing well now? Of course, this question must be put into context with the entire market going through a downturn right now, but these graphs give us some intuition.

Example: Backblaze

In contrast to the privately-held HireEZ, Backblaze is public - they IPO'd in November 2021. Public companies have an indicator of their health that's unavailable for private companies: their stock prices. But how does this correlate to employee count? And which is the better signal of company health?

Here's a screenshot of Backblaze's stock since their IPO, courtesy of Google Finance:

Backblaze's stock since their IPO

And here's their employee count, which has risen pretty steadily, with a very small reduction in the past couple months.

Backblaze's employee count between March 2020 & February 2023

What do you think about their current health? What does this graph tell you that the stock price didn't? Have you learned something new?

I'm in! How can we make graphs like this?

Obviously, we have to scrape LinkedIn in some way, as LinkedIn doesn't provide an API. If you're a regular reader of this blog or already use the Proxycurl API, it's great to have you back! If not, you may want to read this post about scraping LinkedIn for structured data.

In this particular case, though, there's actually another option that we could theoretically use, which you might be familiar with: LinkedIn's in-house Premium Business Insights. However, if you want to analyze this at scale (which we do!) then it would be a TOS violation to scrape data using this tool, and you'd risk your LinkedIn account being banned. So we are back to the Proxycurl API, built for developers looking to scrape LinkedIn.

There is no API endpoint where we can say requests.get(some_endpoint, {url="company"}) and have it return what we want. But that's okay, that's why the rest of this post will be a lot of fun! We have built a tool that allows you to make use of three endpoints to perform a calculation that approximates the result. In the rest of this post, I will introduce you to the Proxycurl Historic Employee Count Tool.

Want to run it yourself? It's available as a docker container. To download, run:

docker pull ghcr.io/nubelaco/historic-employee-count-tool:master

And to execute run:

docker run -it ghcr.io/nubelaco/historic-employee-count-tool:master PROXYCURL_API_KEY TARGET_COMPANY_LI_URL > employee_count_history.csv

You can also clone the repository yourself if you want to edit the Python code or you're not comfortable with Docker, and there's some additional documentation available in that repo's README that's not covered here.

How to build a historic LinkedIn employee count tool

High-level overview of the tool

  1. Grab the current month's total employee count from LinkedIn. This is very easy.
  2. For each of the past N months, grab a snapshot of the number of employees with public profiles on LinkedIn (since this is all that's available to us). This is a little tricker, but still doable.
  3. Use step 1 and 2 to calculate X in the following ratio for each month: previous snapshot : X = current month's snapshot count : current month's total told us by LinkedIn.

Proxycurl endpoints we will use

  • Employee Listing Endpoint - One of the company endpoints, this lists every employee in a company and gives links to their profile URLs. In the Proxycurl API, a LinkedIn URL is always the unique identifier of an entity, be it company, person, job, or anything else.
  • Employee Count Endpoint - Another one of the company endpoints, this does exactly what it says: It gives us a count of the employees employed by the company. You can get both cached information from Proxycurl in the form of a linkdb_employee_count, which can be either past, current, or all employees. For this tool, we're more interested in the linkedin_employee_count, which is scraped directly from LinkedIn and includes private profiles.
  • Person Profile Endpoint - This endpoint is optional and is a performance enhancement and cache invalidator. We could, if we wanted, use the first endpoint with the enrich_profiles=enrich option instead. That endpoint would then enrich our first query with the person endpoint profile results. But for performance, we can batch our queries here & use the async Proxycurl Python client library to run the script a bit faster - and with the use_cache=if-recent flag, which lets us ensure our data is never more than 29 days out of date.

How to implement the high-level overview

  1. Query data from these endpoints.
  2. Create an array of datetime.date intervals using timedelta and calendar.monthrange (honestly, this would probably have been the hardest part of the entire project had I not done something nearly identical several years ago).
  3. Check when the experiences.start and experiences.end ranges intersect the month intervals we created in step 2 for every employee whose profile.
  4. Progressively accumulate these intersections into an array of total_employee_ranges.
  5. Calculate, for each month, X in our ratio from the high-level overview:
    @staticmethod
    def get_adjusted_employee_counts(past_employee_counts: List[int], current_employee_count: int) -> List[int]:
        current_employee_estimate = past_employee_counts[0]
        adjusted_employee_counts = []
        for item in past_employee_counts:
            if current_employee_estimate == 0:
                adjusted_employee_counts.append(0)
                continue
            adjusted_employee_counts.append(int(item * current_employee_count / current_employee_estimate))
        return adjusted_employee_counts
    
  6. Construct a csv to print to the user.

Gotchas

There's a few gotchas we have to address:

  • The proxycurl-py library raises and logs an exception on profiles that 404, but in this particular case we are expecting some profiles not to exist, so we need to catch & silence this particular exception.
  • In some cases, users will link to an internationalized URL of the company, for example https://pl.linkedin.com. We will match against only the last portion of the url (keeping in mind there could be a trailing slash), or:
     @staticmethod
     def identifier(url: str) -> str:
         arr = url.split('/')
         return arr[-2] if arr[-1] == '' else arr[-1]
    
  • Off-by-one: What to do about the current month? We don't want to print it to the user, it's going to be incomplete. But in our proportion of previous snapshot : X = current month's snapshot count : current month's total told us by LinkedIn, the entire RHS (right-hand-side) must refer to the CURRENT month. So we do have to include the current month in our data set. The choice we made, therefore, was to include the current month in my months object, but then not print it to the user at the end:
    for i, month in enumerate(self.month_ranges):
        if i == 0:
            # Recall we used the first slot for current information, and not for a full month of data.
            continue
        o.append(f"{month['end'].strftime('%Y-%m-%d')},{adjusted_employee_accounts[i]}")
    

Do you have to query the entire company to get meaningful results?

Probably not! We tested on Stripe, which LinkedIn lists as having 8003 employees, and the linkdb_employee_count from the Employee Count Endpoint gives 8766 when employment_status is set to all. If we query only 3000 of these, or a bit under half, we can get a pretty accurate picture of the trend line, and better yet, our script takes only about 6 minutes to run. Here's a graph with various limits:

Querying Stripe with different methods & limits

In the graph above, the solid blue line shows the trend if we didn't do the extra work involved to use the Person Profile Endpoint at all, and simply used the employee listing endpoint with enrich_profiles=yes. As you can see, the trend is still mostly visible; however, this method is significantly slower than the async Person Profile Endpoint method, and we don't recommend it.

The orange heavy dashed line represents the most accurate data; this line was generated by using all company data. The next two lines show the query limited to 3000 and 1000 employees, respectively, and finally we showed a query limited to 500. We chose 3000 as a default, but you can use this chart to guide your decision based on the size of the company you're querying and how precise a result you want.

You might wonder: is there any bias in the ordering of the sample data? The Proxycurl orders by LinkedIn ID, so hopefully employees aren't leaving and joining based on alphabetical order! But just in case they are, this sample is being determined randomly:

    @staticmethod
    def get_limited_sample_of_urls(past_employee_urls: List[str], limit: int) -> List[str]:
        if limit == -1 or limit >= len(past_employee_urls):
            return past_employee_urls
        ret = []
        ordering = list(range(len(past_employee_urls)))
        shuffle(ordering)
        for i in range(limit):
            ret.append(past_employee_urls[ordering[i]])
        return ret

More data is coming

Intrigued? We have more of these! Stay in touch to make sure you don't miss out on it by subscribing to our newsletter. Or sign up for our API and build this project yourself - if you identify something interesting give us a shout at hello@nubela.co & maybe we'll feature you in another post!