Hartley Brody

Asynchronize All The Things

One of the first steps of tackling a new web application is figuring out where bottlenecks might arise.

Now, you don’t want to go too far down this route, or you’ll end up designing yourself into oblivion. As Donald Knuth famously said:

Premature optimization is the root of all evil.

But it’s still good to have a sense of where your system might get bogged down under heavy load. For example, database calls are often the first place you’ll start noticing slowness in your app. Or if you have to make a network request – say, hitting a third party API – that can add more than a full second to the amount of time it takes your application to serve a response to a visitor.

Break It Down

I’ve heard it said that good code should only serve a single, specific purpose. If you can take a given process and break it down into two or more simpler processes, that’s usually the best way to go.

I always kind of knew that was a good design goal, but I didn’t realize how powerful it was until I started playing around with message queues.

Analytics for Noobs

I recently started working on a new side project and one of the core parts of the application is an analytics system, that shows artists how their fans are interacting with their music. That might not sound too challenging – after all, it’s just writing lines to a database.

But the reason this challenge seemed particularly daunting was because the service would likely receive hundreds or even thousands of bits of analytics information coming in each second during peak load.

I had never designed a system that operated under that type of environment.

Mvsic Architecture

I had learned a few lessons in scalability when my blog had made it to the front page of HN for a few hours, but that was expensive and resulted in lots of downtime.

If this was going to be a serious service that paying clients were relying on, I had to do much better.

“Log Now, Process Later”

At HubSpot, the analytics team has built an incredibly robust analytics pipeline that aggregates information from all of our 7000+ customers’ websites. I talked to a few engineers who work on the system to see how it was built.

At a high level, we log every analytics request immediately into a log file. Then, there are separate processes that constantly chew through the log files, pulling out the data and saving it into our analytics database.

This ensures that we never lose any data – if something goes wrong with the processors or they get overwhelmed with a sudden burst of traffic, we keep on logging and simply spin up some new processors to chew through the logs faster and get everything up to date. Data is never lost because of a bug.

This separates the serving of our customers’ websites from the actual tracking of that traffic. This allows us to serve the sites as soon as possible, and then process the analytics data asynchronously, so that visitors aren’t waiting for our analytics system before they can see the page.

Do as little as possible to serve the page, and do the rest asynchronously.

The Speed/Reliability Tradeoff

This all sounded great. It allowed for fast page load times with no need to hit a database at all during the request/response cycle. The only problem with it was that the processing of analytics data was slow.

While we never lost any data for our customers, it would take minutes – occasionally, hours – before the data our customers were seeing in their dashboard was in line with the actual traffic patterns on their websites. Chewing through logs and rotating them constantly was slow and surprisingly error prone, so our system was always a bit behind.

Again, I don’t actually work on these systems at HubSpot, so I can’t speak to their actual complexity, but this is just what I observed from talking to a few people and keeping an eye on some of our internal dev dashboards.

This is all in line with the trade-offs proposed in the CAP Theorem. Conjectured in 2000 by a UC Berkeley professor, the theorem states that distributed computer system can only provide 2 of the 3 following things:

  1. Consistency: data that shows up in one part of the system is immediately available to all other parts
  2. Availability: the system always responds to every request
  3. Partition tolerance: the overall system doesn't lose data, even if parts of it fail

HubSpot’s system is essentially focused on #2 & #3. And this is in line with the business goals of the software: traffic and leads information is crucial for our customers – we can’t lost any of it – but our customers don’t necessarily need that data in real time. So we have developed a system that is slower, but reliable.

But it got me wondering – what if I conceded #3 in favor of #1? That is, I was willing to lose some data here and there if it meant I could report analytics to my customers in soft real-time. What would that system look like?

Message Queues

In a stoke of impeccable timing, the CEO of Crashlytics was scheduled to give a presentation called “Scaling Crashlytics: Designing Web Backends for 100 Million Events per Day” at the local Boston Tech meetup later that week.

I attended eagerly, and listened as he explained how his company uses RabbitMQ to process millions of events each day, and send those crash reports to their customers in real time.

Boston Tech meetup

I had heard of RabbitMQ before, but it had sounded complicated and scary. But after his talk, I checked out the well-written tutorials and realized how powerful RabbitMQ – and message queuing in general – could be.

Essentially, RabbitMQ is an open-source Message Queuing service. You can take a “message” (any arbitrary data), put it in a queue, and then have one or more “workers” that pulls messages from the queue and processes them – putting them in a database, for example.

RabbitMQ is written in Erlang – a language designed for fault-tolerant, ultra-reliable telecommunications networks. The CEO described how – in his experience – it had extremely low memory and processing requirements. One box could easily handle the volume of requests I was expecting.

But the best part about using RabbitMQ is how easily it scales.

Queues can support an unlimited number of workers, and each worker goes off and does its work independently of the other workers – they can be on totally different machines. This has a number of awesome benefits:

  • Tasks are done in parallel. Rather than doing 10 slow operations in a row, you can have 10 workers each doing the operation at the same time for a 10x speed increase.
  • Workers are isolated. If there's an unhandled exception while one of the workers is doing it's thang, the other workers can keep chugging right along, pulling messages out of the queue as if nothing's happened.
  • Messages aren't lost. If something goes wrong while a worker is processing a task, then that message gets rotated back into the queue and passed along to the next worker. The queue can even be setup so that if the machine it's running on dies, the messages that were in the queue at the time can still be restored.
  • The number of workers is flexible. If there's a big burst of activity or you're not processing through your queue fast enough, you simply bind more workers to the queue and your throughput will increase.

For the analytics pipeline I’m building, these are all fantastic benefits, especially that last one. The fact that I can easily dial the number of workers up or down means I can start out small. I don’t need to have lots of machines running workers when there is little activity on the site. I just need one or two for most of the time, and then I can add more as the site grows over time, or if a sudden burst of activity starts to fill up the queue.

Using a message queuing design for my analytics pipeline means that the analytics processing and storage happens asynchronously, freeing up my web servers to focus exclusively on serving web pages.

By dividing up the services like this, it’ll be easier to work on, test, and scale each component of the project separately, in accordance with its own specific resource needs.

A One-Off Project

With all of its awesome benefits, you might think that RabbitMQ is hard to setup or configure. Fortunately, it’s not.

Last week, I was working on a project in the office where I needed to lookup several thousand companies’ websites and run them through our Marketing Grader API. At first, I dashed off a quick python script that took the list of companies, looked up their websites and then ran the websites through Marketing Grader one at at time.

Twelve hours after it started running, the script was only about half way through the list when it stumbled on a weird character in one of the company names and crashed.

It dawned on me that the slow, potentially buggy API calls would work much better if they were setup as a message queue system. I setup some workers that would take a company name and lookup the website, and other workers that would take that website and look up the Marketing Grade.

If the API calls timed out, or returned some unexpected responses, a worker could be restarted gracefully without having to restart the entire process from the first company.

Dozens of workers processing a RabbitMQ queue

With 12 workers running, my desktop got a little hectic, but the script ran through the full list in under an hour. It could have gone even faster if I had spun up more workers.

Using message queues to separate out each step of the process was not only much faster and more reliable, but also really simple to setup.

You can checkout the script I used here: get_inc_5000.py

Asynchronize All The Things

A well designed web application should do as little as possible to serve the visitor’s request, so that it can return a response quickly. Any processing, database hits or network calls that aren’t necessary for serving a response should be done asynchronously with a message queuing system like RabbitMQ.

And even outside the context of web applications, message queues allow multiple jobs to be run in parallel and are massively horizontally scalable. Whenever you find your code getting tripped up on a slow bottleneck, it’s worth thinking about whether it makes sense to pull that process out and run it asynchronously.

They don’t necessarily make sense for every task, but it’s a great tool to have in your tool belt, and one that I’m glad I’ve picked up.