Preventing Web Scraping: Best Practices for Keeping Your Content Safe
Many content producers or site owners get understandably anxious about the thought of a web scraper culling all of their data, and wonder if there’s any technical means for stopping automated harvesting.
Unfortunately, if your website presents information in a way that a browser can access and render for the average visitor, then that same content can be scraped by a script or application.
Any content that can be viewed on a webpage can be scraped. Period.
You can try checking the headers of the requests – like User-Agent or Cookie – but those are so easily spoofed that it’s not even worth doing.
You can see if the client executes Javascript, but bots can run that as well. Any behavior that a browser makes can be copied by a determined and skilled web scraper.
But while it may be impossible to completely prevent your content from being lifted, there are still many things you can do to make the life of a web scraper difficult enough that they’ll give up or not event attempt your site at all.
Having written a book on web scraping and spent a lot of time thinking about these things, here are a few things I’ve found that a site owner can do to throw major obstacles in the way of a scraper.