This article does not, by any means, constitute legal advice. Views expressed are purely opinion and are not legally binding. Consult with your local legal counsel for advice specific to your project, country or application.
First things first
If you’re new to the blog or web scraping (welcome!), here’s a TLDR: A web scraper is a bot that automatically and systematically goes through a target website and extracts specific information from it. This data set can then be used for a variety of applications such as to automate hiring, leads generation, or product, price and sentiment analysis.
In our previous post, we talked about why every salesperson should scrape Linkedin here
Some cool end applications include sentiment analysis, product & price analysis, geospatial analysis, etc.
So is it legal or not?
Short answer: Yes… and no. More on that below.
As standalone entities, web scrapers and crawlers are not illegal. You could scrape your own website without any repercussions whatsoever. However, lines are blurred should you choose to scrape another’s website, without their explicit permission or in disregard of their Terms of Service (ToS). This is where things become a little tricky. Freely extracting data from another site could be argued as trespassing or theft.
That being said, should legal action be taken, each situation would be evaluated on a per case basis. For instance, the plaintiff (website operator) would be required to prove that the defendant (scraper) has directly and primarily caused damages i.e loss of revenue, loss of traffic, bandwidth costs, etc. There are, of course, numerous defense approaches that the scraper may take.
That being said, it is noteworthy that web scrapers have been gaining momentum. Large companies are starting to pay attention to them, and media coverage has shot up over the past few years. Here are two recent cases of legal action taken against web scrapers.
LinkedIn v HiQ
As of September 2019, the US federal court has rejected LinkedIn’s appeal to ban San Fransico company, HiQ, from scraping its members’ publically available information.
The 9th US Circuit Court of Appeals ruled that web-scraping was not in violation of the Computer Fraud and Abuse Act (CFAA), originally passed in 1986 as the federal anti-hacking law. The law imposes criminal and civil liability on persons who access a computer connected to the Internet “without authorization” or “exceeds authorized access”. However, it does not clearly define “without authorization”, leading to its interpretation becoming increasingly murky in the modern Internet age. The adoption of more expansive definitions has spawned the potential risk of innocuous online behavior being criminalized.
HiQ had been scraping publically available data to fuel its HR tech analytic tools that could, for instance, predict when employees would jump ship. LinkedIn slapped HiQ with a cease-and-desist letter and announced that technical measures would be put in place for added deterrence. In a turn of events, HiQ fired back, filing a suit and obtaining a preliminary injunction at the district court, on the premise that it was ‘likely to not be in violation of the CFAA’. The lawsuit was then escalated to the 9th US Circuit Court of Appeals. HiQ won.
Circuit Judge Marsha Berzon stated that allowing tech giants such as LinkedIn having such autonomy over who had access to public user data could risk the formation of ‘information monopolies’ which could harm the public interest.
“LinkedIn has no protected property interest in the data contributed by its users, as the users retain ownership over their profiles. And as to the publicly available profiles, the users quite evidently intend them to be accessed by others” - Berzon
The court had also pointed out that breaching authorization would imply accessing something that was not otherwise available, whereas the default LinkedIn profile allows free access. As such, the process of using scrapers to access the data did not constitute ‘breaking and entering’ into computers that the CFAA was meant to address.
Facebook v Power Ventures
Kicking off in 2008, the Facebook v Power Ventures lawsuit nearly mirrors the recent LinkedIn v HiQ fiasco. Both narratives involved tech titans litigating smaller players for web scraping, citing a violation of the CFAA, eventually ending up at the US 9th Circuit Court of Appeals. The two cases share several other similarities, albeit polar opposite outcomes.
Power Ventures served as “all your friends in one place” platform, allowing users to access several of their social media channels such as AOL, Facebook, LinkedIn, Twitter, Myspace, etc. Features included being able to view updates, profile pages, or send messages to multiple friends across multiple websites. Part of the lawsuit’s focus was on Power Ventures’ scraping of content for and from users on Facebook and into Power Ventures interface. Specifically, the CFAA had been broken when Power Ventures allowed users to access Facebook data after it blocked a specific IP address Power Ventures was using to connect to Facebook data.
Power Ventures allegedly circumvented Facebook’s technical measures to block being scrapped. While Facebook did not own the rights to its users’ profile data, it did own copyright claims to the arrangement and creative design of the website. According to Facebook, Power Venture’s scrappers operated in a manner that involved wholly copying the entire site in order to extract user data.
In February 2012, the district court for the Northern District of California found Power Ventures guilty of making unauthorized copies of Facebook’s site, amongst other allegations and was subsequently ordered to pay Facebook a fine of $3 million in damages to Facebook.
In 2016, the Ninth Circuit held that Power Ventures had violated the CFAA, on the grounds of failing to respect Facebook’s cease-and-desist letter and explicit request to revoke Power Ventures’ access to its system.
A Shift in Views?
Following the Ninth Circuit’s arguably poorly reasoned ruling on the Facebook v Power Ventures case, there has been a large influx of attempts to use the CFAA to threaten competitors. The most prominent example being LinkedIn v HiQ.
The claim is that in those earlier cases, authorization was generally required, with the data not being public per se, or the website operator revoking authorization or never granting it at all. However, how ‘public information’ and ‘authorized access’ can be interpreted is becoming increasingly broad. This could be one of the factors attributing to the unpredictability of such legal proceedings. For instance, a wealth of ‘private’ information is available behind a login wall on Facebook. But this information is available to every user, which makes it as good as publically available.
Conversely, these cases have brought about increased awareness of the issue. There has been growing sentiment that the use of the CFAA must not be abused as a means to ‘bully’ smaller players into having limited access to public information on the web. This is especially imperative in an age where we are already beginning to lose access to information, with governments and large private corporations making opaque decisions.
Looking at past and ongoing legal proceedings, it seems that there are no black and white laws that clearly distinguish between legal and illegal. The lack of adequate predecessors means that lawsuit outcomes are likely to be unpredictable, with the key variable being how the companies’ legal counsel present and argue their case.
Lastly, it seems that the two broad concerns that would incite large companies to take legal actions are if the web scraper is:
- Causing them technical setbacks, leading to inefficiencies
- Threatening to eat into their revenue pie
“It’s public data anyway.”
Despite the data being public, the ‘creative arrangement’ of data can be copyrighted.
“Facts cannot be copyrighted. However, the creative selection, coordination, and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.” - cendi.gov
Hence, it is important to be mindful of how data is scraped. Copying a website wholesale could be considered a copyright violation.
In the US, copyrighted material is protected by the Digital Millenium Copyright Act (DMCA).
“I have no plans to publish, distribute or sell this data. It’s strictly for my personal use”
You’d still have to be compliant with websites’ ToS. For instance, view Facebook’s policy here.
“This isn't any different from manually collecting data anyway.”
Well... You’re not wrong but other players might not be pleased if you’re, for instance, causing a strain on their bandwidth or taking a slice of their revenue pie. Regardless, anything that leads to a loss for them could result in a lawsuit for you, so tread with caution.
“Isn’t Google a crawler too?
Over the decades and a first-mover advantage, Google has positioned its reputation as one of the titans of industry. It has amassed deep enough pockets to face the financial repercussions of web crawling. Essentially, Google is large enough to deal with the legal system.
Having read all that, here’s some general advice on how to proceed with caution:
- When in doubt, consult your lawyer before proceeding
- Use an API instead if one is available
- Asking for permission is probably a good idea
- Respect the website’s Terms of Service
- Respect their robot.txt
- Asking for permission is probably a good idea
Proxycurl’s Opinion of Legality
To conclude, web scrapers are not illegal when regarded as standalone entities. Issues only arise when one ‘trespasses’ onto another’s domain to extract data without permission.
While there seems to be an increasingly favorable legal position for web scrapers, it is important to note that these types of cases are relatively new. Owing to that, there is still a huge element of unpredictability.
Typically, it seems that the two broad issues that would incite legal consequences from website operators are when:
- The scraper is causing them technical difficulties, resulting in inefficiencies.
- The scraper is eating into their revenue pie.
Of course, the scraping would have to be significant enough to appear on their radar.
Nonetheless, it would make sense to simply be respectful, as far as possible.