Scraping E-Commerce Sites

As retailers shift away from traditional brick and mortar stores towards e-commerce, there is no questioning the importance of data. Flash sales such as 10.10 and 11.11 have been gaining traction over the years, spurring consumers to buy, buy, buy.

In this previous article, we briefly discussed the use cases of data extracted from e-commerce behemoths such as Lazada, Shopee or Taobao.

A look at Shopee:

From just Shopee’s homepage, one can extract valuable information such as product pricing, sales volume, and promotion strategies.

On the product page, reviews and ratings can be seen:

With the above data set, one could conduct a price sensitivity, consumer behavior, and consumer sentiment study. This could be used for your own product, competitor analysis or market analysis. For the last two, a large data set is needed to prevent any biases or skews.

A Further Illustration: Lazada Indonesia
A quick search on Lazada Indonesia reveals that there are:

7,301,369 Products
51,422 Brands
48,793 Sellers
114,167 Categories

Now imagine all that you could do if you had an index, tailored to your preferences, at your fingertips.

Pricing Intelligence

With multitudes of available substitutes on the platform, it is likely that consumers are going to be more price-sensitive than when shopping in a physical retail store. Therefore, for retailers planning to pursue an online strategy, pricing analysis is key. What price point optimizes sales revenue? What promotion entices a consumer to hit ‘check out’? Which competitors’ product moves the most volume? This is information that is crucial in crafting the ideal product pricing strategy. One would need a clean and scaled data set to support the approach.

Market Research

With the correct time-series set, one could infer the following: What is the growth or decline of sales in a certain product category? Has consumer satisfaction increased or decreased? Aggregated sales volume, consumer reviews/rating, and pricing data could be used to provide a bird’s eye view across a specific platform, which could be used to do a product analysis and determine the next steps to take in a particular product category. Finally, to determine entry or exit from the category, capital investment decisions, and long-term business strategy.

Data Extraction Challenges
Say, for instance, that your company is planning to launch a new wireless earpiece and you wish to conduct a competitor analysis. A quick search for ‘wireless earphone’ on Shopee pulls up over 48,000 product results. Naturally, due to the dynamism and scale of product pricings on Shopee, it is simply not pragmatic to have the data extraction to be done manually. The obvious path forward is to use a web scraper.

2 daunting challenges that come with building a large scale e-commerce scraper are handling the sheer number of requests being made and building an intelligence layer to address the problems that entail said requests.

Amount of Requests…. And Challenges
Should one wish to take on a large scraping project, one challenge would be managing the large volume of requests (more than 20 million) being made. A large pool of IPs would be needed in their proxy pool so as to not get flagged and blocked. Furthermore, the pool needs to have IPs spanning across the globe. This is because e-commerce retailers have been known to display different prices to IPs in different regions. The only reliable workaround is to have various data points to work with.

Did you know: Proxycurl crawls the web with 150,000 unique residential IP addresses located around the world

One responsibility that comes with a large proxy pool is dedicating the time and resources to manage it. This is no small feat. It is not uncommon for developers and data scientists to be spending more time managing and troubleshooting data quality issues than analyzing the extracted data.

Hence, to scrape the web in scale, it’s crucial to implement a robust intelligence layer to support your proxy management. Here is a non-exhaustive list of capabilities of a sophisticated, automated proxy management layer.

Ban identification: Some examples include rate limiters, captchas, blocks, and redirects. An intelligent proxy solution would be able to detect, troubleshoot and fix numerous types of bans.
Retry errors: Should your proxies experience any errors, bans, blocks etc, they need to be able to retry the request with different proxies.
Add delays: To prevent being detected as a bot, automatically randomized delays may be implemented to circumvent web scraping measures.
Geographical targeting: Sometimes you need to configure proxies from specific geographical regions to be used on certain websites.

On top of the above, you’d also need to create and manage a database to address the various issues across websites scraped, which is no easy feat.

Conclusion

All in all, there’s really no disputing the importance of data in today’s digital age. Needless to say, retailers recognize the increasing value of data-driven insights. Web scraping tools are one viable method of capturing ever-changing price information quickly and at scale. However, building, maintaining and managing an ongoing large-scale scraping project is no small feat. This begs the question: Should you build your own scrapper or have it outsourced? (Here are some options for non-programmers).

Questions on your next steps? Feel free to chat with us at hello@proxycurl.com

Or jump right in with an API trial here!

Good luck!