LinkedIn profiles have become a powerful place to gather data on individuals. Different parties will have their own usage of LinkedIn profiles, such as LinkedIn data mining, profile research or leads generation.
There are many methods out there to procure the information we need. In this tutorial, we’ll be using one of the more simple ways to obtain the data we need, which is to use Python to simulate a Google search and get ahold of the URLs returned from the search results. Further processing can be done, either manually or via automation, as we discuss in our other tutorial, How to Build LinkedIn Automation Tools with Python With a Code Example.
The main reason for this is to get familiar with one of the most common methods of web scraping, which is browser-based scraping. For those unfamiliar, this is where we simulate human behaviour on the web by using a browser. We will utilize a tool called Selenium, which is an open-source testing framework for web applications, which allows us to start a browser through the script. Its usage goes far beyond testing, however, as we will soon demonstrate.
Before we start, we need to set up the project and download the dependencies we need:
Let’s simulate a Google search. We will be making use of one of the more unknown Google search features, which is Google search operators, also referred to as advanced operators. These are special characters and commands that help to provide a more strict criterion for our search term, hence narrowing down our search results. You can read here for a comprehensive list of operators.
def create_search_url(title, location, *include): result = "" base_url = "http://www.google.com/search?q=+-intitle:%22profiles%22+site:linkedin.com/in/+OR+site:linkedin.com/pub/" quote = lambda x: "%22" + x + "%22" result += base_url result += quote(title) + "+" + quote(location) for word in include: result += "+" + quote(word) return result
We first start by defining our base URL. As you can see, there are search operators already present. The
-intitle operator tells the engine to find pages with the word 'profiles' in the title tag. We also use the
-inurl operator to ensure we only want pages with LinkedIn profile URLs. We can specify the default profile URLs that starts with
linkedin.com/pub/ or personalised URLs that start with
linkedin.com/in/ using the
Then, we narrow our search even more by using double quotation marks to use exact-matching. Since we can’t enter double quotations in a Python string, we should either use the escape character (“\”) before the quotes or use the Unicode-equivalent in the URL which is
%22. We enter the desired title/position and the country we passed as arguments into the quotes. To make our function less rigid, we allow additional arguments if you want to specify more criteria. Anything passed as
include will be added for exact matching.
You can test this function yourself by printing out the result and pasting that result in your browser search bar. You can actually generate more specific Google search URLs by using free online tools like Recruit’em XRay Search.
Now we will make use of Selenium’s WebDriver to spin up our own automated browser. We can use any of the browsers supported by Selenium (Google Chrome, Mozilla Firefox, Opera, etc), but for now, we will use Chrome. Here we can also specify if we want to run this browser in headless mode, which is to say without the browser window popping up (no GUI), by adding
max_page = 5 all_urls =  options = webdriver.ChromeOptions() # options.add_argument('headless') # specifies the path to the chromedriver.exe driver = webdriver.Chrome(options=options) # always start from page 1 page = 1 driver.get(create_search_url("software engineer", "singapore", "developer", "nubela")) while True: time.sleep(3) # find the urls urls = driver.find_elements_by_class_name('r') urls = [url.find_element_by_tag_name('a') for url in urls] urls = [url.get_attribute("href") for url in urls] all_urls = all_urls + urls # move to the next page page += 1 if page > max_page: print('\n end at page:' + str(page - 1)) break try: next_page = driver.find_element_by_css_selector("a[aria-label='Page " + str(page) + "']") next_page.click() except: print('\n end at page:' + str(page - 1) + ' (last page)') break print(all_urls)
We then navigate to the URL returned by the previous function using
driver.get(). This will open the page on the browser. Then it's time to get our hands on the profile URLs that are on this page of Google. We can use the browser 'Inspect' developer tool to see the HTML tag of the page links. For every link on the page, the class name of the HTML
div will be
r. As per convention, the hyperlink, or the profile URL, will be embedded within the
hrefs attribute of the
Every time we finish scraping the Google page for URLs, we navigate to the next page by finding the button using
driver.find_element_by_css_selector() and simulating a click action. Here we specified a counter for the number of pages and a maximum page which you can set a reasonable number for. If the next page doesn't exist, then we stop our crawling and return our results.
You have now obtained the LinkedIn profile URLs that you need. Easy, right? All you have to do now is to gather data available on these profiles. Unfortunately, bulk web scraping LinkedIn itself won’t be this easy because LinkedIn has measures in place to prevent this from happening. Read our other article, Why You Shouldn’t Use LinkedIn Automation Tools using YOUR OWN Account, to understand that even using some LinkedIn automation tools might danger your LinkedIn account.
Good news, you can now scrape LinkedIn profiles WITHOUT risking your account with Proxycurl! You can read the introduction to Proxycurl's LinkedIn API here. With Proxycurl as a LinkedIn profile API, you can use your generated profile URLs and get all the data you need! We also have another tutorial of scraping LinkedIn with Python using Proxycurl, How to Build LinkedIn Automation Tools with Python With a Code Example, to complete the data mining process.