Hartley Brody

Web Scraping Boilerplate: Everything You Need to Start Your New Python Scraping Project (Batteries Included)

Over the years I’ve worked on dozens of web scraping projects big and small. While every scrape has its own specific needs in terms of finding and extracting data, I’ve found that there’s lots of generic logic that ends up being pretty common across scrapes.

I always found myself digging through older projects to find bits of code I had come up with to solve generic scraping challenges like:

  • using a database to store scraped data
  • helpers for making requests and handling network errors
  • getting python packages installed (requests, beautiful soup)
  • setting up redis to manage a queue of work
  • rotating proxies and detecting ones that aren’t working
  • keeping track of which data was collected when
  • managing changes to the database model over time

That’s why I finally decided to create a web scraping boilerplate project.

I love the idea of boilerplate projects (I wrote one for flask web apps last year). It keeps all of those common bits of code stored and organized in one place, where they’re easy to reuse between projects. No more reinventing the wheel – it helps me get up to speed with new scraping projects more quickly, and I hope it will help you as well!

The code is available for free on github.

While the project’s readme.md has information about how to set things up locally on your computer, I thought I’d add a bit more detail about how to use the code and why I set things up the way I did.

HTTP Request Logic

Every site that you scrape will have its own patterns of requests that you’ll need to make, as well as patterns in the markup that you’ll need to hunt through to get the data you’re hoping to collect. But while that logic will vary between sites and projects, there are some common tasks that any web scraping program will have to make.

This is what’s in the utils.py file.

The most handy function is make_request. We’re using kennethreitz’s famous requests library to handle the underlying HTTP requests, but we’re also adding a bit more logic on top. This includes

  • handling network exceptions (intermittent failures)
  • setting a timeout on requests so they don’t “hang” forever
  • picking a proxy to use for the request
  • detecting whether the target website has blocked this proxy
  • returning the response content to our calling code

There are also a few other functions in there related to choosing a proxy to use for the request.

I’ve written more about using proxies for web scraping before, but the gist is that we want to read in a big list of IP addresses and ports, and then go down the list for each request, rotating back to the top once we get to the end. This ensures even rotation of requests through each proxy server, so that none appear to make a suspicious amount of requests to your target website.

Storing Data in a Relational Database

The next chunk of logic that gets handled for us is the data storage components. While some scrapers simply download the full contents of every web page to parse through them later, I usually find it simpler to capture the data as we go.

For those who are new to relational databases, imagine it like a spreadsheet with rows and columns. Each record is a new row that we’ll add to the table, but first we need to define all of the columns and what type of data should go in each.

We define the records we want to store in the models.py file. In there is a BaseMixin class to impart some generic logic to all of our models, like giving them an ID attribute, plus a first_seen_at and last_seen_at timestamp.

Any records that you want to store should subclass both the sqlalchemy declarative_base() as well as the BaseMixin, define a __tablename__ attribute, and then any other attributes that you want to store on your model.

sqlalchemy is a handy library that I use a lot as a web developer, which makes it simple to query and work with records from your relational database. I included some boilerplate code for a somewhat esoteric but common use-case, keeping track of items, queries and search results, with the search_results table being a mapping between the queries and the items.

We also know that our data models might evolve over time, as the needs of our scrape change or we discover new fields that we want to add. Rather than blowing away the entire database and starting again, there’s a handy library called alembic that allows you to apply migrations to your tables.

This library looks at the changes you made to your models in models.py, compares that to the current state of your database, and then generates a migration files that specifies the commands you need to run on your database to make it look like what’s in your models.py file. This is another library I used a lot as a web developer.

Keeping Track of Work in a Queue

Storing and managing work in queues is a vastly complicated topic that’s outside the scope of a web scraping article. However, I’ve found that a simple setup using redis covers 90% of my use cases for scraping.

Queues are a great idea for web scraping, for two main reasons:

  1. They allow you to spread the backlog of work out across multiple workers, to increase the rate at which you’re collecing data.
  2. They keep a persistent state of what work is still left to do, in case you need to pause and restart without losing your place.

Of course, when we’re talking about scraping, work on the queue usually means a URL that needs to be visited. Any time a scrape involves making requests to over 1000 URLs, I try to use some sort of work queue.

The queue.py file has some basic code for connecting to a redis server, and then putting things into a queue or taking things out of a queue.

Note that I’m using redis sets as the default queue implementation. The benefit of this is that it automatically dedupes items in the set, which means that if I try to add the same URL to the queue multiple times, it will only show up once. This helps ensure that we don’t scrape the same URL multiple times.

A potential downside of sets is that there’s no implied ordering of items. It’s not first-in-first-out (FIFO) or first-in-last-out (FILO). When your worker goes to pull a URL from the queue, it will be a random one from the list of all URLs in the queue. It’s up to your and your needs whether it’s important that URLs are visited in a specific order.

Setting up Your Environment

Since the scraper will need to connect to a database server and a redis server, and manage a few third party libraries, it’s important that we have a bit of configuration code in place so we can easily setup a new environment on our local computer, or on whatever server we will be running the code from.

Most of the details for setting this up are in the project’s readme.md file. Make sure you run through all of the steps in there to get your local environment all setup.

One thing to pay attention to is the .env file in the project. Normally, you wouldn’t check this in with your code since it’s where you put secret values like database passwords, but I’ve provided one with the project to help you get setup.

You’ll need to make sure that you have both your database URL and your redis URL specified in there, so the code knows how to connect to the database and redis servers. Note that this servers could live directly on your computer, in which case the URL would simply point to localhost.

Finally, once you’ve got your environment setup, make sure you create a migration file using alembic, and then apply the migration to your database so that its setup and ready to accept incoming data. The commands for these are in the readme.md file.

Running Your First Scrape

The rest of the scraping logic is up to you, and will depend on the target site you’re scraping. I provided some example code in example.py that scrapes my own website – Scrape This Site – which is designed for beginners who want to learn web scraping. This is intended to show you how to put all of the various pieces together to build a scraper.

At the top of example.py, you can see that we’re importing from queue.py, utils.py and models.py. Then we define the search function that takes in a keyword, makes a request to the page, and stores the data it finds in the database.

Notice this pattern of having a single function that takes a simple input, looks it up and stores data about it. This pattern of processing one input at a time is very easy to split out into a worker. This allows us to create multiple workers pulling inputs from the queue and then running this function on those inputs at the same time.

Try to avoid writing code with loops that make requests in each iteration, as a general rule.

Note that the example code won’t actually store any of the hockey team statistics from the page we’re scraping, since the example data we’re pulling doesn’t correspond with the attributes we’ve defined on our Item class in models.py.

The bottom of the file (everything under if __name__ == '__main__') is a bit obtuse, but it specifies a few different ways that we can run our code from the command line. If you want to run the code with a single keyword, you can specify the keyword on the command line, like so

python example.py -k bruins

If you’d like to specify a text file of keywords, with one keyword per line, you can run it like so

python example.py -f input/keywords.txt

Note that running it this way only dumps the keywords from the file into the queue. It doesn’t actually run the search function on each keyword.

Finally, to run the code in “worker” mode, pulling keywords from the queue, performing the lookup, and then repeating that process over and over until there’s no work left, you can run it like so

python example.py -w

Note that once a worker finds that there’s nothing left in the queue, it’ll sleep for 60 seconds before checking the queue again. You could open multiple terminal windows and run the same worker command in each one, and then you’d have multiple workers pulling keywords from the queue in parallel.

I suggest trying the code both in “one-off keyword” mode with the -k flag, and in “keyword file” mode then “worker” mode to get a feel for how it all comes together.

Once you’re familiar with that, it’s off to the races!

If you’re looking for quick recipes or scraping patterns, you can check out my web scraping cheat sheet for copy/paste-able code samples to help with tricky situations.

If you get stuck with an error you don’t understand, you can check out my tips for debugging code for new developers.

Good luck, and happy scraping!