Hartley Brody

In its simplest form, web scraping is about making requests and extracting data from the response. For a small web scraping project, your code can be simple. You just need to find a few patterns in the URLs and in the HTML response and you’re in business.

But everything changes when you’re trying to pull over 1,000,000 products from the largest ecommerce website on the planet.

Amazon Crawling

When crawling a sufficiently large website, the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.

This was my first time doing a scrape of this magnitude. I made some mistakes along the way, and learned a lot in the process. It took several days (and quite a few false starts) to finally crawl the millionth product. If I had to do it again, knowing what I now know, it would take just a few hours.

In this article, I’ll walk you through the high-level challenges of pulling off a crawl like this, and then run through all of the lessons I learned. At the end, I’ll show you the code I used to successfully pull 1MM+ items from amazon.com.

I’ve broken it up as follows:

  1. High-Level Challenges I Ran Into
  2. Crawling At Scale Lessons Learned
  3. Site-Specific Lessons I Learned About Amazon.com
  4. How My Finished, Final Code Works

First there were desktop software products, then everything moved to the web. Then there were email-based products and even SMS-based ones. The latest craze in software interfaces is messenger bots, and Facebook has the largest chat platform by a long shot.

facebook-chatbot

In this tutorial, I’ll show you how to build your own Facebook Messenger Chat Bot in python. We’ll use Flask for some basic web request handling, and we’ll deploy the app to Heroku.

Let’s get started.


In my work as a full stack web developer, I often meet clients who request that I sign a Non-Disclosure Agreement (NDA) at various stages of the project.

By signing an NDA, the client is basically asking me to agree that I won’t take their idea and work on it myself, or share it with anyone else who will.

For most entrepreneurs, that sounds like a smart idea, right?

arms folded

Here’s why I won’t sign them.


As more and more people are learning to code, there are an increasing number of new developers in the work force. Code bootcamps are springing up everywhere that promise to land candidates a job with only a few months of experience.

Having worked with a number of junior developers (and having been one myself, at one point ;) ) I’ve noticed a lot of the same mistakes crop up.

In an effort to help junior developers level up to eventually become engineers, I decided to enumerate some of the mistakes I see most often, as well as my recommended solutions and takeaways.


It’s the classic champagne problem that most successful web apps will deal with – there are so many users on your site that things are starting to get bogged down.

Pages load slowly, network connections start timing out and your servers are starting to creak under heavy load. Congratulations – your web app has hit scale!

But now what? You need to keep everything online and want the user’s experience to be fast – speed is a feature after all.

Scaling Comes at a Price

But before we go any further, an important caveat – you shouldn’t attempt to “scale” your web app before you’ve actually run into real scaling problems.

While it may be fun to read about Facebook’s architecture on their engineering blog, it can be disastrous to think that their solutions apply to your fledgling project.

server-complexity

A lot of the solutions to common scaling bottleneck introduce complexity, abstraction and indirection which makes systems more difficult to reason about. This can create all sorts of problems:

  • Adding new features takes longer
  • Code can be harder to test
  • Finding and fixing bugs is more frustrating
  • Getting local and production environments to match is more difficult

You should only be willing to accept these tradeoffs if your app is actually at the limits of what it can handle. Don’t introduce complexity until it’s warranted.

As the famous quote goes:

Premature optimization is the root of all evil.
— Donald Knuth


While there are tons of resources designed to help people learn to code, there aren’t as many resources for helping people learn to build software products, at a higher level.

“Writing code” is largely a vocational skill, just like swinging a hammer is – but presumably you’re using that skill to actually build something.

In my experience, the reasons software projects fail – take too long, go over budget, are too complex – isn’t necessarily because of bad coding practices. It’s because there was too much focus on writing code, and not enough on building a product.

It’s deceptively simple (and all too common) for a non-technical person to come up with a few sentences describing an idea, shoot it to someone with coding chops and say “code this for me.”

It’s essentially the modern-day equivalent of someone sketching a building on a napkin, handing it to a carpenter and telling them to start swinging their hammer.

shutterstock_152551241

There are often huge structural decisions – as well as a million tiny implementation details – that need to be fleshed out before you even start writing a line of code.

If the first thing you do with a product idea is start coding, your project is almost certainly doomed.


A few weeks ago, a friend texted me asking for advice. She was interested in learning how to code and wanted to know how I had done it.

While I did take a handful of Computer Science classes in college, I consider most of my relevant, day-to-day software development skills to be things I picked up through self-guided learning.

My initial advice for her was going to be pretty banal – sign up for Codecademy or Treehouse or one of the many “learn to code in 12 weeks” bootcamps.

But before I could send her that text, I realized that my own path to becoming a full-time, freelance software developer didn’t really look anything like that.

While there are a growing number of programs, classes and websites that purport to teach you coding skills in a short amount of time – and I’ve played with a few of them myself – I don’t really see them as an effective path to learning the kinds of skills one needs to be a competent software developer.

And so, to answer her question, I decided to take some time to look back on what I actually did, and what got me to the point I’m at today, earning a living writing code for people.

Text files ending in .html

It all really got started for me out of sheer boredom over winter break in 2008, during my freshman year of college. Having refreshed my Facebook News Feed for the millionth time that day and not found anything interesting, I decided to click the magic “view source” option in the browser and see if I could understand any of the HTML. Of course, it was all completely indecipherable to me, but I did notice the “.php” extension in the facebook.com/home.php URL.

That piqued my interest and after a bit of googling I discovered that PHP was some kind of language that would produce HTML for a browser to read. It all sounded really complicated so I figured I’d just start with the HTML part.

I opened up Notepad on my Windows laptop and saved a file to the desktop, making sure to change the extension from “homepage.txt” to “homepage.html”. From there, I read through the w3schools tutorials on HTML and built a page using table elements for layouts.

I’ll always remember the first time I opened a new tab in my browser and opened the HTML file on my desktop and saw a freaking web page that I had just freaking made. I mean look at it! It looks like a web page, and I made it!

I have made fire!

It was probably the first big “AHAHAHA!” moment that got me hooked on building stuff with code. I felt like a superhuman. Maybe I could build a site like Facebook! But not quite yet…

I bought a domain, picked a $5 webhost and figured out how to upload my shiny new HTML file so that the world could see it. I proudly emailed my site to some of the guys that worked on the campus life blog.


Formerly titled “The Rise of the Server-less Web Stack”

Javascript has lots of cool stuff built on top of it now. These days, there are tons of well-worn frameworks that bring all sorts of powerful programming paradigms into the browser.

Want easy object-orientation? Use backbone. More of a functional programmer? There’s underscore, lodash and many others. And I can’t keep up with the latest template rendering libraries, but there are dozens.

Plus ECMAscript 6 is rolling out quickly and with it, some long-awaited language features, syntactic sugar and new APIs.

Additionally, there are a lot of JS SDKs and simple integrations for things like accepting payments (stripe), analytics (mixpanel, customer.io, etc) if you don’t want to write the code or support the infrastructure to do those things yourself.

With all of these features, one can build an entire, bonafide web application in pure javascript. This certainly isn’t a new idea – single-page javascript applications have been around for years.

serverless-web-stackBut what if we take the power of javascript to its logical conclusion – making the entire app live in the user’s browser.

Do we even need to deal with setting up servers and maintaing a separate codebase for a server-side backend at all?


logomark-orange@2xWhen I started as the first employee at Burstworks, the cofounders and I could easily hold the information about who was working on what at any given moment in our brains.

But as we worked on new projects and the scope and size of the engineering team grew, all of our code mostly stayed organized in one central repository:

  • Our high-performance ad server
  • Data Pipeline
  • One-off scripts
  • Nightly jobs
  • Everything...

While we generally weren’t working on the exact same files at the same time, there was still lots of stepping on toes. Having your git push rejected was a common occurrence.

Inevitably we had issues with merge conflicts, which lead me to send this tweet from our company account:

And so I decided to take a step back and think about how we managed our version control system at Burstworks.

I definitely didn’t want to come up with something heavy handed or overly-proscriptive. The goal was to come up with just enough process to grease the wheels, and not slow things down.

I did some reading, came up with some initial ideas and pitched them to the team. We iterated a bit and here’s what we came up with.

I should start out by saying that it’s nothing revolutionary or new. It’s what I would consider the Minimum Viable Git Best Practices™ for a small engineering organization.


A few months ago, I got a chance to dive into some Python code that was performing slower than expected.

lightning-fast-serialization-pythonThe code in question was taking tiny bits of data off of a queue, translating some values from strings to primary keys, and then saving the data back to another queue for another worker to process.

The translation step should have been fast. We were loading the data into memory from a MySQL database during initialization, and had organized the data structure so that the id -> string lookups were constant time.

Finding the Problem

In order to figure out where the bottleneck(s) were, I used Python’s builtin CProfile package, and combed through the results using the awesome CProfileV package, written by a former Quora intern.

After letting the script run for awhile, the bottleneck jumped out right away – the workers were spending about 40% of their time serializing and deserializing data.

In order to keep messages on the queue for other workers to pick up, we were translating the Python dicts into JSON objects using the standard library’s json package.

Our worker was reading the text data from the queue, deserializing it into a Python dict, changing a few values and then serializing it back into text data to save onto a new queue.

The translation steps were taking up about 40% of the total runtime.

So I set out to see if there was a faster way to serialize a Python dict.