How we scraped Ahref's Evolve 2024 attendees via Micepad and enriched it with LinkedIn data to identify noteworthy people to network with (Spreadsheet Included)

My marketing team and I are attending the Ahref Evolve 2024 conference in 2 days time. My objective (as the CEO of the company) in attending this conference is really to identify a few brilliant individuals whom I might be able to hire to join Proxycurl to bolster our marketing team. Just yesterday, I was logging into Micepad, the official platform for the conference, to check out the event agenda and schedule; and I found that the full list of attendees is located within the dashboard for my browsing.

As a product CEO and a software developer, I noticed that the attendees were lazily loaded as you scrolled to the end of the page. Also, I noticed that the full list of attendees can be fetched. The problem is that I lack finer details of attendees. What kind of companies do they work for? How noteworthy are they? Instantly, I realized that Proxycurl with our Person Lookup Endpoint is the solution I am seeking. I already have the first/last name and the company they are working with; therefore I can pair each attendee with the LinkedIn profile URL, and then enrich it further with their profile data such as follower count, etc.

But before I jump into it, here's the attendee list for Ahrefs Evolve 2024.

Scraping And Paginating Attendees

I opened up my handy developer tools in Firefox, and scrolled to the end of the page of Micepad Attendee page; checked out the XHR requests that was happening and I saw this request simulated by this curl command:

curl 'https://app.micepad.co/api/web2/getInstanceAttendees' \
  --compressed \
  -X POST \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Referer: https://micepad.co/' \
  -H 'Content-Type: application/json' \
  -H 'Origin: https://micepad.co' \
  -H 'DNT: 1' \
  -H 'Connection: keep-alive' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-site' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'TE: trailers' \
  --data-raw '{
    "expressions":[],
    "moduleid":221220,
    "searchTerm":"",
    "offset":4,
    "apikey":"xxx-censored-xxx",
    "instanceid":6751,
    "languageid":7277
  }'

Lo and behold, the attendees were returned in JSON like this:

[
  {
    "email": "",
    "firstname": "Aaron",
    "lastname": "Taylor",
    "title": "SEO Director",
    "company": "Prosperity Media",
    "country": "AU",
    "phonenumber": "",
    "industry": "",
    "photoPath": "",
    "biography": "",
    "businessCardPath": "",
    "isValid": true,
    "fullName": "Aaron Taylor",
    "userid": 1027150,
    "instanceid": 6751,
    "instanceUser": {
      "userid": 1027150,
      "type": "user",
      "bmstatus": 1
    },
    "bmChoicesIds": [],
    "realScore": 0,
    "totalScore": 0,
    "rank": 13,
    "allowInitiateText": 1,
    "allowReceiveText": 1,
    "allowInitiateVideoCall": 0,
    "allowAnswerVideoCall": 0,
    "allowRequestMeeting": 1,
    "allowAcceptMeeting": 1,
    "onlineStatus": 0
  },
  {
    "email": "",
    "firstname": "Aaron",
    "lastname": "Sin",
    "title": "Specialist, Marketing Communications & Investor Relations",
    "company": "Tiong Woon Crane & Transport Pte Ltd",
    "country": "SG",
    "phonenumber": "",
    "industry": "",
    "photoPath": "https://data.micepad.co/data/uploads/6751/profile/SXdxlp_AaronSinNoBackground_6de0bb99-a619-48f2-bbc2-84b148283712_resized512.png",
    "biography": "",
    "businessCardPath": "",
    "isValid": true,
    "fullName": "Aaron Sin",
    "userid": 1044715,
    "instanceid": 6751,
    "instanceUser": {
      "userid": 1044715,
      "type": "user",
      "bmstatus": 1
    },
    "bmChoicesIds": [],
    "realScore": 0,
    "totalScore": 0,
    "rank": 18,
    "allowInitiateText": 1,
    "allowReceiveText": 1,
    "allowInitiateVideoCall": 0,
    "allowAnswerVideoCall": 0,
    "allowRequestMeeting": 1,
    "allowAcceptMeeting": 1,
    "onlineStatus": 0
  },
  ...
]

With common sense while analyzing the request, you can see that I can basically paginate the entire attendee list by incrementing the offset value from 0 to N.

Here's the parameter set in question:

{
    "expressions":[],
    "moduleid":221220,
    "searchTerm":"",
    "offset":4, // <--- increment this number starting from 0
    "apikey":"xxx-censored-xxx",
    "instanceid":6751,
    "languageid":7277
  }

So, with some trusty python for which I barely touch these days, I wrote this up to paginate through Micepad to get the full list of attendees:

def scrape_attendees():
    url = 'https://app.micepad.co/api/web2/getInstanceAttendees'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Content-Type': 'application/json',
        'Origin': 'https://micepad.co',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Referer': 'https://micepad.co/',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-site',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache',
    }

    payload = {
        "expressions": [],
        "moduleid": 221220,
        "searchTerm": "",
        "offset": 0,
        "apikey": "xxx-censored-xxx",
        "instanceid": 6751,
        "languageid": 7277
    }

    all_attendees = []
    total_records = None

    while True:
        response = requests.post(url, headers=headers, json=payload)
        data = response.json()

        if total_records is None:
            total_records = data['totalRecords']
            print(f"Total records: {total_records}")

        attendees = data['attendees']
        all_attendees.extend(attendees)
        print(f"Fetched {len(all_attendees)} attendees so far...")

        if len(attendees) == 0 or len(all_attendees) >= total_records:
            print(f"Fetched {len(all_attendees)} attendees in total")
            break

        payload['offset'] += 1

    print(f"Total attendees fetched: {len(all_attendees)}")

    with open('attendees.json', 'w') as f:
        json.dump(all_attendees, f, indent=2)

Now I have a list of attendees. But that isn't enough. I want to enrich each of these attendees with their LinkedIn profile URL and sort these attendees by their follower count.

Enriching Attendee With LinkedIn Profile URL And Profile Data

It turns out that Proxycurl has the perfect tool for this. With a person's first and last name, and the company they are working with, there is a very good chance we'll be able to match this person to his/her LinkedIn Profile URL via the Person Lookup Endpoint. On top of that, it can enrich the matched LinkedIn Profile URL with the full dataset, all within one API request.

So given that there are 522 attendees and I don't have all day. So I decided to sprint through the attendee list by making API calls to the Person Lookup Endpoint concurrently. I did this with a worker pool of 10. (Coincidentally, while doing so, I found a bug within our API endpoint for which we were returning the wrong status code when we failed to match it with any result. This is now being fixed, hah.)

Here's how I did it:

def enrich_attendee(attendee):
    api_key = PROXYCURL_API_KEY

    url = 'https://nubela.co/proxycurl/api/linkedin/profile/resolve'
    headers = {
        'Authorization': f'Bearer {api_key}'
    }
    params = {
        'company_domain': attendee.get('company', ''),
        'first_name': attendee.get('firstname', ''),
        'last_name': attendee.get('lastname', ''),
        'enrich_profile': 'enrich',
        'location': f"{attendee.get('city', '')} {attendee.get('country', '')}".strip(),
        'title': attendee.get('title', '')
    }

    retries = 0
    max_retries = 2
    while retries < max_retries:
        response = requests.get(url, headers=headers, params=params)
        if response.status_code == 200:
            enriched_data = response.json()
            attendee['enriched_data'] = enriched_data
            break
        elif response.status_code == 503:
            retries += 1
            if retries < max_retries:
                print(f"Received 503 error. Retrying ({retries}/{max_retries})...")
                time.sleep(2 ** retries)  # Exponential backoff
            else:
                print(f"Failed to enrich attendee after {max_retries} retries: {attendee['fullName']} - Status code: {response.status_code}")
                print(f"Params for failed request: {params}")
        else:
            print(f"Failed to enrich attendee: {attendee['fullName']} - Status code: {response.status_code}")
            break

    return attendee

def enrich_attendees():
    with open('attendees.json', 'r') as f:
        attendees = json.load(f)

    # Load existing enriched attendees
    try:
        with open('attendees_enriched.json', 'r') as f:
            enriched_attendees = json.load(f)
    except Exception:
        enriched_attendees = []

    # Create a set of already enriched attendee IDs for quick lookup
    enriched_ids = set(attendee['userid'] for attendee in enriched_attendees)

    # Create a lock for thread-safe writing
    write_lock = threading.Lock()

    def enrich_and_save(attendee):
        if attendee['userid'] not in enriched_ids:
            enriched_attendee = enrich_attendee(attendee)
            with write_lock:
                with open('attendees_enriched.json', 'a') as f:
                    json.dump(enriched_attendee, f)
                    f.write('\n')
            print(f"Enriched and saved attendee: {enriched_attendee['fullName']}")
            return enriched_attendee
        else:
            print(f"Skipped already enriched attendee: {attendee['fullName']}")
            return None

    executor = ThreadPoolExecutor(max_workers=10)
    futures = []

    try:
        for attendee in attendees:
            futures.append(executor.submit(enrich_and_save, attendee))

        for future in as_completed(futures):
            enriched_attendee = future.result()
            if enriched_attendee:
                enriched_attendees.append(enriched_attendee)

    except KeyboardInterrupt:
        print("\nScript interrupted by user. Aborting remaining tasks and saving progress...")
        executor.shutdown(wait=False, cancel_futures=True)
    finally:
        executor.shutdown(wait=True)
        print(f"Enriched {len(enriched_attendees)} attendees. Results saved to attendees_enriched.json")

Quite easily done. This was completed in a few minutes.

Exporting to CSV

JSON is not my favourite way to work with files. I much prefer my spreadsheet UI. And my spreadsheet of choice is Google Sheets. So, I decided to export the dataset to CSV for which I can easily open with Google Sheets.

The last step for me was to export the data to CSV file, for which I uploaded to Google Sheets, which you can view here.

This was how I did it:

def export_to_csv():
    try:
        with open('attendees_refreshed.json', 'r') as f:
            attendees = [json.loads(line) for line in f]
    except FileNotFoundError:
        print("Error: attendees_refreshed.json not found. Please run the 'refresh' command first.")
        return

    csv_filename = 'attendees_enriched.csv'

    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['First Name', 'Last Name', 'Latest Job Title', 'Current Company',
                      'LinkedIn Profile URL', 'Current Company LinkedIn Profile URL',
                      'Follower Count', 'Bio', 'Headline', 'Country']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for attendee in attendees:
            enriched_data = attendee.get('enriched_data', {})
            profile_data = enriched_data.get('profile', {})
            experiences = profile_data.get('experiences', [])
            current_job = experiences[0] if experiences else {}

            writer.writerow({
                'First Name': attendee.get('firstname', ''),
                'Last Name': attendee.get('lastname', ''),
                'Latest Job Title': attendee.get('title', ''),
                'Current Company': attendee.get('company', ''),
                'LinkedIn Profile URL': enriched_data.get('url', ''),
                'Current Company LinkedIn Profile URL': current_job.get('company_linkedin_profile_url', ''),
                'Follower Count': profile_data.get('follower_count', ''),
                'Bio': profile_data.get('summary', ''),
                'Headline': profile_data.get('headline', ''),
                'Country': profile_data.get('country_full_name', '')
            })

    print(f"Exported {len(attendees)} attendees to {csv_filename}")

I Lied, I Did More

Yeah, it turns out that the Person Lookup Endpoint returns data that has varying levels of freshness. I wanted fresh profile data. So I needed to make another pass on the attendee data via the Person Profile Endpoint with the use_cache=if-recent parameter to fetch fresh profile data.

Of course, this is completely optional and I really did this because it was cool. This was trivial given that I already have the LinkedIn Profile URL. Given that this is optional, I will leave this as an exercise to the reader.

Enrichment Is Really Useful

I whipped this up in 2-3H and I barely write code. And within a day, I have a list of attendees to the conference I'm attending with detailed profile data.

If I can do this for an attendee list, just imagine what you can do for your business or your CRM.

Anyways have fun! If you're attending the Evolve 2024, hit up the marketing team at Proxycurl via [email protected], and we'll be happy to have a chat with you!