In a digital world, content-gobbling, intellectual-scraping web bots pose a tremendous risk to today’s organisations. In addition to stealing intellectual property and/or data, bots that perform content scraping can perform so many requests that it leads to a Denial-of-Service situation. In addition, a company can lose revenue thanks to aggregators and price comparison websites or due to information leakage.
Web scraping refers to a type of software tool that harvests data from websites and uses it for a variety of purposes. If a browser can render it, it can be scraped. There are five main use cases for web scraping:
- Content scraping (lifting content from a site and posting it elsewhere without permission).
- Price comparison.
- Data monitoring (weather, stocks, etc.).
- Website change detection.
What can be done to prevent scraping of online assets? Some information that can be considered sensitive has to be public in order to be useful. Common examples include airfares, hotel rates and lists of physicians. In some cases, sites try to obfuscate their data. Using dynamic grids, AJAX and/or WebSockets to download the actual data, they aim to make it much more difficult to scrape data records.
Whatever a browser can render to the screen will be retained in memory as part of a structured Document Object Model (DOM) and that content will be accessible from scripting or programming libraries.
Ironically, most of the commonly used scraping tools were designed for another purpose entirely: quality assurance. Selenium and similar tools are used for web application testing. They enable developers to simulate and automate user interactions, allowing test responses from the web app. But that same functionality makes it possible to use Selenium and similar tools to automate scraping of any data that’s publicly available. Headless or real browser clients can also be used to make the bot detection even harder. These techniques help in mimicking user behaviour, passing challenges and thwarting other bot-detection algorithms.
Scraping services is just a Google search away. Just as it has become easy to acquire DDoS-as-a-Service, it is also fast and simple to access online services for web scraping.
Abused By Bots: Scraping Stories From the Frontlines
Securing online bargains is among the most common uses for web scraping. Web-scraping tools make it relatively easy to track online prices and create numerous requests once a price drop is identified. Compared to humans, bots are far more efficient in generating requests, producing multiple requests per minute (whether real or fake). A possible result: emptying the online store inventory so grey marketers can resell the goods at a higher price.
Chances are, you’ve experienced this firsthand. Think about the last time you heard about an upcoming concert. The very moment tickets were available online, you tried to buy some. Yet all the good seats were already gone! Later, you found those seats—at five to eight times the cost—on ticket-broker websites. You can thank web scraping for that.
Airlines are another common target of web scraping. Bots can be programmed to “scrape” certain flights, routes and classes of tickets. With the bots acting as faux buyers—continuously creating but never completing reservations on those tickets—the airline was unable to sell the seats to real customers. In essence, the airline’s inventory was held hostage, and a growing number of flights were taking off with empty seats that could have been sold.
In the UK, a name-brand website was operating from behind a paywall. The company was not concerned about web scraping until they found their entire website was being scraped and offered for free on a Chinese hosting site.
Many years ago, I helped an online store that was plagued by competitors putting $99,000 of merchandise into their shopping carts and going to the checkout phase. The snag? Even though the competitors didn’t actually complete the checkout process, the inventory appeared depleted. For real customers, everything was showing as “out of stock” and had to be back-ordered. Beyond that, this online store was finding that its competitors were visiting its site to do price comparisons. After web performance optimisation blocked the scraper bots, the site’s web traffic bandwidth decreased by 66%.
With those bots out of the mix, website page speed and performance doubled. That demonstrates how much bandwidth had been serving bad bots!