Scaling Your Web App 101: Lessons in Architecture Under Load
It’s the classic champagne problem that most successful web apps will deal with – there are so many users on your site that things are starting to get bogged down.
Pages load slowly, network connections start timing out and your servers are starting to creak under heavy load. Congratulations – your web app has hit scale!
But now what? You need to keep everything online and want the user’s experience to be fast – speed is a feature after all.
Scaling Comes at a Price
But before we go any further, an important caveat – you shouldn’t attempt to “scale” your web app before you’ve actually run into real scaling problems.
While it may be fun to read about Facebook’s architecture on their engineering blog, it can be disastrous to think that their solutions apply to your fledgling project.
A lot of the solutions to common scaling bottleneck introduce complexity, abstraction and indirection which makes systems more difficult to reason about. This can create all sorts of problems:
- Adding new features takes longer
- Code can be harder to test
- Finding and fixing bugs is more frustrating
- Getting local and production environments to match is more difficult
You should only be willing to accept these tradeoffs if your app is actually at the limits of what it can handle. Don’t introduce complexity until it’s warranted.
As the famous quote goes:
Premature optimization is the root of all evil.
— Donald Knuth
Find the Actual Bottleneck using Metrics
The first step to alleviating any problem – in software or otherwise – is to clearly and accurately define what the problem actually is.
A problem well stated is a problem half-solved.
— Charles Kettering
For a web app that’s under too much load, that means finding out what resource your application is running out of on the server.
At a high level, the answer is usually going to be one of four things:
- Network I/O
- Disk I/O
Until you figure out what resource your application is bounded by, no one can help you scale your app and any solutions you come up with will be complete guesses.
Figuring out what you’re bounded by means checking your resource monitoring – or adding some if you’ve never done it before.
What gets measured, gets managed
— Peter Drucker
If you’re managing your own servers, installing Munin is a great first step. If you’re running on Amazon’s EC2, AWS offers some decent instance monitoring out of the box. If you’re on Heroku, New Relic seems to be the best approach.
Use the graphs to look for spikes or flat tops. These usually imply that some resource was overwhelmed or completely at capacity and couldn’t handle any new operations.
If you don’t see any resources that seem to be at capacity, but your app is just slow in general, sprinkle some logging throughout heavily-used operations and check the logs to see if there’s some resource that’s taking a long time to load over the network.
It could be that another server is introducing delays – potentially your database server or a third-party API.
If you host your database on a different machine than your web servers (which you should) it’s important to check your resource monitoring for that machine as well as for your web servers.
The database is usually the first place scaling issues start to show up.
Scaling a Web App from 10,000 Feet
Now that you’ve got a much better sense of what the problem is, you should start to tackle it by trying the simplest solution that directly addresses the issues – remember, we’re always trying to avoid adding unnecessary complexity.
At a high level, the goal of any scaling solutions should be to make your web stack do less work for the most common requests.
If you’ve already figured out the answer to a query, reuse it. Or if you can avoid computing it or looking up all together, even better.
In a tangible sense, this usually means one of the following:
- Store results of common operations so you're not repeating work
- Reuse data you've already looked up, even if it's a bit stale
- Avoid doing complex operations in the request-response cycle
- Don't make requests from the client for things it already has
These all basically boil down to some form of caching.
Memory is not only inexpensive to add to a server, it’s usually many orders of magnitude faster for accessing data when compared to disk or the network.
Whether your application is hosted in the cloud or on hardware, some part of your stack will inevitably fail. You should host and arrange your web servers to take this into account.
Your domain should point to some sort of load balancer, which should then route requests between two or more web servers.
Not only does this setup make it easy to survive failures, it also makes handling increased load easier as well.
With a load balancer in front of two web servers, you can horizontally scale your application by bring up new web servers and putting them behind the load balancer. Now the requests are spread across more machines, meaning each one is doing less work overall.
This allows you to grow your application gracefully over time, as well as handle temporary surges of traffic.
I should also add that setting up a load balancer and two web servers is a one-time setup that doesn’t add much on-going complexity, so it’s something you should consider doing up-front, even before you’ve run into scaling problems.
Cache Database Queries
This is one of the simplest improvements you can make. There’s usually a few common queries that make up the majority of load on your database.
Most databases support query logging, and there are many tools that will ingest those logs and run some analysis to tell you what queries are run most frequently, and what queries tend to take the longest to complete.
Simply cache the responses to frequent or slow queries so they live in memory on the web server and don’t require a round-trip over the network or any extra load on the database.
Obviously, data that’s cached can grow “stale” or out-of-date quickly if the underlying information in the database is updated frequently. Your business or product requirements will dictate what can or can’t be cached.
Database indexes ensure that needle-in-a-haystack type lookups are O(1) instead of O(n).
In layman’s terms, this means the database can find the right row immediately, rather than having to compare the queried conditions against every single row in the table.
If your table has tens of thousands of rows, this could shave noticeable amount of time off of any queries that use that column.
As a very simple example, if your application has profile pages that look up a user by their handle or username, an un-indexed query would examine every single row in the users table, looking for the ones where the “handle” column matched the handle in the URL.
By simply adding an index to that table for the “handle” column, the database could pull out that row immediately without requiring a full table scan.
A lot of applications handle sessions by storing a session ID in a cookie, and then storing the actual key/value data for each and every session in a database table.
If you find your database is getting slammed and your application does a lot of reading and writing to session data, it might be smart to rethink how and where you store your session data.
One option is to move your session storage to a faster, in-memory caching tool like redis or memcached.
Since these use volatile memory rather than persistent disk storage (which most databases use) they’re usually much faster to access – but the tradeoff is that you run the risk of losing all of your session data if the caching system needs to reboot or go offline.
Another option is to move the session information into the cookie itself. This obviously leaves it open to being tampered with by the user, so it shouldn’t be used if you’re storing anything private or sensitive in the session.
By moving session data out of the database, you’ll likely eliminate several database queries per page load, which can help your database’s performance tremendously.
Run Computations Offline
If you have some long-running queries or complex business logic that takes several seconds to run, you probably shouldn’t be running it in the request-response cycle during a page load.
Instead, make it “offline” and have a pool of workers that can chug away at it and put the results in a database or in-memory cache.
Then when the page loads, your web server can simply and quickly pull the precomputed data out of the cache and show it to the user.
A drawback here is that the data you’re showing the user is no longer “real time,” but having data that’s a few minutes old is often good enough for many use-cases.
If the data really takes a long time to generate, see if it can be parallelized so that multiple workers can work on different parts of the computation at the same time.
You’ll probably want to setup another cluster of machine for the work queue and the workers, since those will likely have different scaling properties than your web servers.
To take this architectural style to its logical conclusion, you can generate the HTML for your entire web app offline and simply serve it to users as static files.
This is the inspiration behind static site generators that are used to power a growing number of blogs (including this one), and it’s what the New York Times did to serve election night results.
HTML Fragment Caching
If you’re rendering HTML templates on the server-side, you want to avoid having your template engine wasting CPU cycles on every request generating the same HTML over and over again for content that doesn’t change often.
If there are certain sections of your site’s markup that change very infrequently – say the navigation, footer or sidebar – then that HTML should be cached somewhere and reused between requests.
Pay special attention to high-traffic pages. Sometimes you’ll be able to cache most of the page except for a few more dynamic or “real time” sections.
Putting Work Into Queues
We talked about using queues and workers for caching output that takes a long time to generate. You can also use workers for processing large amount of input asynchronously.
This has the effect of taking large, slow chunks of work and breaking them out from the main request-response cycle and taking it completely off your web servers.
Say you have a way for someone to import a CSV of their contacts and several people upload 50MB files. Instead of sending all of that data to a web server and having it take up memory and block the CPU while it’s being processed – put it on a static file host like s3 and have a worker that periodically checks for new file uploads and processes them offline.
You do have to be careful when you start putting lots of business logic into workers. You have to make sure you’re keeping track of what still needs to be processed, what is currently being processed, and what failed and needs to be processed again.
You also need to make sure you have enough workers running, otherwise the work queue will grow longer and longer, leading to silent delays that are easy to miss.
Client Side Improvements
Of course, another great way to decrease the load on your web servers is to decrease the number of requests they have to deal with.
Even with the same number of users in your app, there are a number of client-side improvements that can lower the number of requests your web stack has to deal with.
Just like you want to cache database queries to avoid regenerating answers you already know, you should avoid having the browser ask for content that it has already downloaded.
Content Delivery Network
Ideally, your web servers wouldn’t serve any static content at all. You don’t need the overhead of loading your entire web framework or language runtime in order to serve a static file off of disk.
You should host your static content on a static file host that’s purpose built for sending files over the network. You can setup a simple one with nginx or use a dedicated service like Amazon’s S3.
When you’re splitting out your static content, it’s good to give the static file host a different CNAME or subdomain.
Once you’ve gotten that setup, it’s usually pretty straightforward to add a Content Delivery Network in front of your static file host.
This distributes your static content even faster – often without the requests ever reaching your web stack.
The CDN is essentially a geographically-distributed file cache that serves up copies of your static files that are often geographically closer to the end-user than your server.
The CDN will often have options to minify and gzip your static content to make the files even smaller and faster to send to the client.
When All Else Fails
If you’re still having issues with high load and you’ve cached as much as you can and your server budget is maxed out, there are still some less-than-ideal options you have to handle excess load.
Back pressure is just a way of telling people they have to wait because the system is slow. If you’ve ever gone tried going to a cafe, saw a line out the door and decided to go elsewhere – that’s a great example of back pressure.
Even without much work on your end, back pressure can be implicit. The site will be slow for users, which will discourage them from clicking around as much.
They also might see increased error rates when they try to load things – think Twitter’s Fail Whale.
You can also make back pressure explicit by adding messaging to the UI telling people that parts of your applications are temporarily disabled due to high demand.
The other option is to shed load. You’re acknowledging that you’re not able to respond to everyone’s requests, so you’re not even going to try.
This is the nuclear option. Set aggressive timeout in your web server and let requests hang and return blank pages. Some people’s data will get lost in cyberspace.
Try to add some temporary messaging to let people know that you’ll be back soon, but be prepared for a PR fallout.
Did you know I do consulting? If you need help scaling your web application, drop me a line!