Regular Expressions & the Proxycurl Search API

📢
As of 1 March 2024, the Proxycurl Search API no longer supports regex, instead we've switched to Boolean search.
Check out the announcement post here.
But this is itself still a great guide for learning regex.

Welcome to the definitive guide to regular expressions (regex) & the Proxycurl Search API!

This guide is aimed at developers of all skill levels. Perhaps you've been using boundary conditions for years, but you're here to learn our specific syntax for invoking case-insensitive mode. Or maybe you're good with capture groups, but "negative lookahead" is a mystery to you. Whatever your skill level, there'll be something that catches your fancy, and we've put everything underneath its own top-level heading. We encourage you to make use of the Table of Contents (TOC) of this article.

Ready to get started?

Guarantee a section is non-empty

One of the earliest customer questions we received was, "How can I make sure that a field is non-empty?" You'll notice that filling out .* doesn't work because .* means "0 or more characters." But its cousin .+ (1 or more characters) does! Indeed, we don't require the + quantifier; . (any character) will suffice.

We can specify the following:

params = {
    'current_company_linkedin_profile_url': 'https://www.linkedin.com/company/apple',
    'linkedin_groups': '.',
}

This snippet finds Apple employees who are members of at least one LinkedIn group.

Search for multiple phrases at a time

We can search for multiple phrases at a time using something called "alternation" but better known as "the pipe character" (this: |). With this character, you can search for multiple phrases, companies, cities, etc., at the same time. Depending on whether you want to "cover" the entire field with your phrase, you may want to also add anchors and a capture group to your pattern. Check out this example:

'current_company_name': '^(Amazon|Apple)$',
'current_job_description': '(?i)software engineer|\\bswe\\b'

We're just demonstrating how to use regular expressions in your Search requests, so let's ignore that this isn't the optimal way to search for software engineers (it would be better to use current_role_title). There are two different types of lookups here:

  • With current_company_name, we want to exact-match the entire field. So we've alternated (aka used the | character) Amazon and Apple inside of a capture group, with ^ and $ on each side.
  • With current_job_description, we want to match part of the field. So we've alternated software engineer and swe without the anchors on each side. Since swe is a very short phrase that can appear inside of other words, we still add \b to each side of it (the \ is escaped for Python).

Whole words only (boundary condition)

This gotcha is something that I fell for when looking up LLMs, and you can see it in the previous section with SWEs as well.

  • Avoid this: |LLMs?|
  • Good: |\\bLLMs?\\b|

Here, the \b is a "boundary condition," which is an example of a "0-width assertion" (remember, in Python, you need to escape each backslash character once, so every regex escape sequence looks like \\). By "0-width," we mean that it doesn't consume any characters in the target match; instead, it "asserts" that the boundary (thus, "boundary condition") of a word is here.

Be careful when searching for positions like CTO

While a query for "CTO" might succeed, it may succeed a little...too much. You may be bringing up several people who hold or held titles like The Assistant to the CTO. People in positions like this tend to be indispensable to their organizations; however, they're probably not who you were searching for. You may want to include anchors (^ and $) when searching for such positions.

Start and end of a phrase

We figure that usually, you're looking for one or more words "somewhere in the middle of" (to use a technical term) the entire target text. You can imagine that your text is surrounded by .* on each side: If you give us kittens, we'll give you a field that exact-matches .*kittens.*. (To be clear, this isn't literally what's going on; we're using our database. But it's a helpful mental model.)

Case sensitivity

Search is implemented differently from the rest of the Proxycurl API, so case sensitivity behaves differently in Search from how it does in our other endpoints as well. If you're ever wondering, you can always check our documentation for which is which.

  • Non-search regex: Always ignores case (case insensitive).
  • Search regex: By default, respects case (case sensitive); you can add the (?i) flag to make it ignore case (case insensitive).

Here are some examples of why you might want to use a Search keyword with or without the (?i) flag:

  • 'current_role_title': '(?i)(software engineer|\\bswe\\b)',: This ignores case (case insensitive) so that you get people who capitalized only one letter, put the entire thing in all-caps, etc. Notice the (?i) flag at the beginning.
  • 'current_company_name': '^(Amazon|Apple)$': This respects case (case sensitive). We want people only from one of these two companies. (A more precise way to do this search would be to break it up into two separate queries and use the parameter current_company_linkedin_profile_url, but see the section on combining fields with OR for why you might prefer this option.)

Negate a phrase with negative lookahead

All of the following examples start with a ^ followed by the syntax for "negative lookahead." That means they're asserting something is true at the start of the regular expression. What we assert depends on our use case.

The following code will ensure that the phrase .*San Francisco is not visible at the start of the city field, or in other words, that San Francisco is not visible anywhere in the middle of the city field:

'city': '^(?!.*San Francisco)'

If you know for a fact that the phrase you're trying to remove would appear at the start of the field, should it appear at all (in other words, something like aaaaaaaaaaSan Francisco is not part of the sample size), then we can make our query a bit more performant and omit the .*, like this:

'city': '^(?!San Francisco)'

To summarize:

  1. 'city': '^(?!.*San Francisco)': Anyone from San Francisco will not be returned to you, nor will anyone from aaaaaaaaaaSan Francisco.
  2. 'city': '^(?!San Francisco)': Anyone from San Francisco will not be returned to you, but someone from aaaaaaaaaaSan Francisco could be.

What if you want both to negate & match at the same time?

Great question. Lookahead cannot happen retroactively. So if you're using a negative lookahead, you have to put it first. And, remember what we said about automatically surrounding your phrase with .*? We did two things in this lookahead:

  1. Anchor to the start of the field with ^.
  2. Specify the negative lookahead with (?!).

Step 1 was necessary because otherwise, literally everything would match - after all, the last character of a field doesn't have San Francisco after it!

So, we no longer have the surrounding .* at the start, and you will have to specify it yourself. Thus, if you want to (hypothetically) search for a city that isn't San Francisco and does have the string ran somewhere - anywhere - in its name, you will have to write this:

'city': '^(?!.*San Francisco).*ran'

Gosh, that was a lot

It turns out you don't really have to understand negative lookahead to use it. All you have to do is copy the three examples:

  • The excluded text could be anywhere in the field -> 'city': '^(?!.*San Francisco)'
  • The excluded text is guaranteed to be at the start of the field -> 'city': '^(?!San Francisco)'
  • Exclude & include text -> 'city': '^(?!San Francisco).*ran'

In each case, replace San Francisco (and possibly ran) with the text you want.

Remember your escape characters

Remember to escape if you are trying to use a backslash character in Python (or any other programming language in which \ is reserved). You will not get a syntax error in Python for typing this:

    'current_role_title': '(?i)\bswe\b',

However, in Python, \b means "backspace." You must escape the backslash (\) if you wish to use a boundary condition (regex \b) here.

The following code will give you the correct result:

    'current_role_title': '(?i)\\bswe\\b',

Note: In Python, another option is to use r''. Don't try using re.escape; (a) it will escape your ? if you are doing a case-insensitive search ((?i)), and (b) it doesn't play nice with \b.

Combining fields - using cross-field OR efficiently

Search fields are always combined with AND. This behavior is ideal for our customers because it makes Search easy to use. Sometimes, though, you might want an OR between fields. You can do this easily by performing multiple search queries in your application logic and concatenating the results. But what if there are three or more fields that you want to OR simultaneously? You would have to pairwise concatenate these, leading to exponential growth in the number of queries that you perform. It turns out that there are some strategies to circumvent this issue:

  • Use a proxy field that allows for alternation. For example, if you wanted to use the Person Search Endpoint current_company_linkedin_profile_url, which only accepts a single URL, you could instead use current_company_name. Keep the regex case-sensitive, anchor it with ^ and $, and you will get very few (if any) false positives.
  • Choose only a subset of your choices to query. For example, you could start out with a preliminary analysis by querying only one-fourth of the data points and then move on to the rest.
  • Do a sanity check to see if all possible intersections even contain any data. For example, it may turn out that 75% of them are empty, and choosing a subset of fields is the same as querying the complete result set.

The biggest LPT of all

If you're about to run a search for 20,000 records, step individually through the first 100 or even 500! This investment of time will make you aware of some improvements that you could add to the script. Maybe you're surprised to see fields matching "Director," and then you realize, "Ahhh, we should have added a \b around CTO." Or maybe you're seeing something missing - perhaps you left out an (?i).

We've given you a lot of power with Search, and with great power comes a great need to verify that things are working manually, which is a form of great responsibility. Also, it can be kind of fun to click a bunch of LinkedIn profiles and see how they match the conditions you told the API to use and figure out whether or not these are the same conditions you wanted. (I've spent a lot of time doing this myself now.)

As always, please let us know if you have any tips or questions you'd like to see added to this post. We're picturing this article as a living document, and we'd love to add your contribution. Reach out at hello@nubela.co & let us know what you've got. We can't wait to hear from you!