How we enriched 7084 records for two professors and a pHD student of University of Hong Kong University of Science and Technology (HKUST)

/ proxycurl

How we enriched 7084 records for two professors and a pHD student of University of Hong Kong University of Science and Technology (HKUST)

Steven Goh
Share:

Subscribe to our newsletter

Get the latest news from Proxycurl

A few weeks ago, a student at the University of Hong Kong University of Science and Technology (HKUST) contacted me. The student represented a group of academics who needed to enrich a list of people with work experience and education history.
The following article will share how I did this profile enrichment exercise with Proxycurl API.

The problem

List came with first_name, last_name, name of current employment, title of current employment, location and work email

Given a list of people, my job is to find their corresponding LinkedIn profiles and enrich the list with work and education history.

As it turns out, the data that HKUST provided is outdated.

Resolving people to their LinkedIn Profile

The list provided by HKUST came with a list of people with general but identifiable information about them. The list includes first names, last names, names of the employer, and their role in their organization. These bits of information are an exact match for Proxycurl's Profile General Resolution Endpoint's input parameters.

To resolve loose bits of information of a person to his/her LinkedIn profile, I wrote the following function in Python code.

async def resolve_profile_url(first_name, last_name, title, country, city, coy_name, company_domain):
    last_exc = Exception()
    for _ in range(RETRY_COUNT):
        try:
            api_endpoint = f'{PROXYCURL_HOST}/api/linkedin/profile/resolve'
            header_dic = {'Authorization': 'Bearer ' + PROXYCURL_API_KEY}
            async with httpx.AsyncClient() as client:
                params = {'company_domain': f"{coy_name} {company_domain}",
                          'title': title,
                          'first_name': first_name,
                          'last_name': last_name,
                          'location': f"{country} {city}",
                          }
                resp = await client.get(api_endpoint,
                                        params=params,
                                        headers=header_dic,
                                        timeout=PROXYCURL_XHR_DEFAULT_TIMEOUT
                                        )
                if resp.status_code != 200:
                    print(resp.status_code)
                assert resp.status_code == 200
                return resp.json()['url']
        except KeyboardInterrupt:
            sys.exit()
        except Exception as exc:
            last_exc = exc
    raise last_exc

With the profile resolution code written, I iterated through a CSV list of people provided by HKUST and got a corresponding match of LinkedIn profiles.

Bulk scraping LinkedIn profiles

The next step is the easy step and is Proxycurl's core competency. Now that I have a list of LinkedIn profiles, all I need to do is send these LinkedIn profile URLs to Proxycurl's Person Profile Endpoint.

Like the resolution endpoint, I wrote a function that takes a LinkedIn Profile URL and returns structured data of the profile.

async def get_person_profile(url: str) -> dict:
    last_exc = Exception()
    for _ in range(RETRY_COUNT):
        try:
            api_endpoint = f'{PROXYCURL_HOST}/api/v2/linkedin'
            header_dic = {'Authorization': 'Bearer ' + PROXYCURL_API_KEY}
            async with httpx.AsyncClient() as client:
                params = {'url': url}
                resp = await client.get(api_endpoint,
                                        params=params,
                                        headers=header_dic,
                                        timeout=PROXYCURL_XHR_DEFAULT_TIMEOUT
                                        )
                assert resp.status_code == 200
                return resp.json()
        except KeyboardInterrupt:
            sys.exit()
        except Exception as exc:
            last_exc = exc
    raise last_exc

Proxycurl API tips

You will notice that the functions I wrote are

  • asynchronous
  • tolerant of unexpected exceptions with a default action of retry

The functions are asynchronous because each request takes an average of 10 seconds. So to maximize throughput, I adopted Proxycurl's best practices. That is to send concurrent requests. In my script, I used 100 workers with Python's `httpx` library to send concurrent asynchronous API requests.

Each request sent to Proxycurl's API is an on-demand scrape job. There is a non-zero chance that a client-side error or network error. When this happens, the right thing to do is always to retry.

With these two tips, I can scrape the entire HKUST file in one go.

Massaging data for output

The team at HKUST needed the output in a specific (Excel) format. After iterating through the list to resolve for LinkedIn profiles and then fetching their corresponding profile data, we now have the needed raw data. All I need to do now is to massage the raw data into the CSV format that HKUST wanted it in.

And this is how I did it:

def massage_data_for_experience():
    with open(PATH_2_OUTPUT, 'r') as output_f:
        output_csv = csv.reader(output_f)
        counter = 0
        for idx, output_row in enumerate(output_csv):
            output_id = output_row[0]
            profile_json = output_row[1]
            profile = json.loads(profile_json)
            counter += 1

            with open(PATH_2_INPUT, 'r') as input_f:
                input_csv = csv.reader(input_f)
                for row in input_csv:
                    input_id = row[0]

                    if input_id == output_id:
                        for exp in profile['experiences']:
                            coy_name = exp['company']
                            title = exp['title']
                            employment_type = exp['employment_type']
                            starts_at = _format_date(exp['starts_at'])
                            ends_at = _format_date(exp['ends_at'])
                            description = exp['description']
                            location = exp['location']
                            linkedin_profile_url = f"https://www.linkedin.com/in/{profile['public_identifier']}"
                            with open(PATH_2_EXP, 'a+') as f:
                                writer = csv.writer(f)
                                writer.writerow([output_id, coy_name, title, employment_type,
                                                 location, starts_at, ends_at, description, linkedin_profile_url])
                        print(f"{counter}: Done.")

This is how the final result.

Do you have an enrichment task?

Hey, I will love to help you out. Let me know if you have a task at hand that requires bulk LinkedIn data scraping. You can shoot us an email at [email protected].

And if you love reading weekly anecdotes as to how we are solving business problems with our data tools, click here to subscribe to our email list :)

Subscribe to our newsletter

Get the latest news from Proxycurl

Steven Goh
Share:

Featured Articles

Here’s what we’ve been up to recently.