Preventing Web Scraping: Best Practices for Keeping Your Content Safe
Many content producers or site owners get understandably anxious about the thought of a web scraper culling all of their data, and wonder if there’s any technical means for stopping automated harvesting.
Unfortunately, if your website presents information in a way that a browser can access and render for the average visitor, then that same content can be scraped by a script or application.
Any content that can be viewed on a webpage can be scraped. Period.
You can try checking the headers of the requests – like User-Agent or Cookie – but those are so easily spoofed that it’s not even worth doing.
But while it may be impossible to completely prevent your content from being lifted, there are still many things you can do to make the life of a web scraper difficult enough that they’ll give up or not event attempt your site at all.
Having written a book on web scraping and spent a lot of time thinking about these things, here are a few things I’ve found that a site owner can do to throw major obstacles in the way of a scraper.
1. Rate Limit Individual IP Addresses
If you’re receiving thousands of requests from a single computer, there’s a good chance that the person behind it is making automated requests to your site.
Blocking requests from computers that are making them too fast is usually one of the first measures sites will employ to stop web scrapers.
Keep in mind that some proxy services, VPNs, and corporate networks present all outbound traffic as coming from the same IP address, so you might inadvertently block lots of legitimate users who all happen to be connecting through the same machine.
If a scraper has enough resources, they can circumvent this sort of protection by setting up multiple machines to run their scraper on, so that only a few requests are coming from any one machine.
Alternatively, if time allows, they may just slow their scraper down so that it waits between requests and appears to be just another user clicking links every few seconds.
2. Require a Login for Access
HTTP is an inherently stateless protocol meaning that there’s no information preserved from one request to the next, although most HTTP clients (like browsers) will store things like session cookies.
This means that a scraper doesn’t usually need to identify itself if it is accessing a page on a public website. But if that page is protected by a login, then the scraper has to send some identifying information along with each request (the session cookie) in order to view the content, which can then be traced back to see who is doing the scraping.
This won’t stop the scraping, but will at least give you some insight into who’s performing automated access to your content.
3. Change Your Website’s HTML Regularly
Scrapers rely on finding patterns in a site’s HTML markup, and they then use those patterns as clues to help their scripts find the right data in your site’s HTML soup.
If your site’s markup changes frequently or is thoroughly inconsistent, then you might be able to frustrate the scraper enough that they give up.
This doesn’t mean you need a full-blown website redesign, simply changing the
id in your HTML (and the corresponding CSS files) should be enough to break most scrapers.
Note that you might also end up driving your web designers insane as well.
4. Embed Information Inside Media Objects
Most web scrapers assume that they’ll simply be pulling a string of text out of an HTML file.
If the content on your website is inside an image, movie, pdf, or other non-text format, then you’ve just added another very huge step for a scraper – parsing text from a media object.
Note that this might make your site slower to load for the average user, way less accessible for blind or otherwise disabled users, and make it a pain to update content.
5. Use CAPTCHAs When Necessary
CAPTCHAs are specifically designed to separate humans from computers by presenting problems that humans generally find easy, but computers have a difficult time with.
While humans tend to find the problems easy, they also tend to find them extremely annoying. CAPTCHAs can be useful, but should be used sparingly.
Maybe only show a CAPTCHA if a particular client has made dozens of requests in the past few seconds.
6. Create “Honey Pot” Pages
Honey pots are pages that a human visitor would never visit, but a robot that’s clicking every link on a page might accidentally stumble across. Maybe the link is set to
display:none in CSS or disguised to blend in with the page’s background.
Honey pots are designed more for web crawlers – that is, bots that don’t know all of the URLs they’re going to visit ahead of time, and must simply click all the links on a site to traverse its content.
Once a particular client visits a honey pot page, you can be relatively sure they’re not a human visitor, and start throttling or blocking all requests from that client.
7. Don’t Post the Information on Your Website
This might seem obvious, but it’s definitely an option if you’re really worried about scrapers stealing your information.
Ultimately, web scraping is just a way to automate access to a given website. If you’re fine sharing your content with anyone who visits your site, then maybe you don’t need to worry about web scrapers.
After all, Google is the largest scraper in the world and people don’t seem to mind when Google indexes their content. But if you’re worried about it “falling into the wrong hands” then maybe it shouldn’t be up there in the first place.
Any steps that you take to limit web scrapers will probably also harm the experience of the average web viewer. If you’re posting information on your website for anyone the public to view, then you probably want to allow fast and easy access to it.
This is not only convenient for your visitors, it’s great for web scrapers as well.
Learn More About How Scrapers Work
Purchase securely with Paypal or Credit Card.
This article is an expanded version of my answer on Quora to the question: What is the best way to block someone from scraping content from your website?