Peeling Back the ORM: Demystifying Relational Databases For New Web Developers
Most web developers building dynamic websites interact with databases every day. Relational databases like MySQL or Postgres are usually the first tool people reach for when their application needs to store data.
But with the recent proliferation of web frameworks like Rails and Django, many web developers rely totally on Object-Relational Mappers (ORMs) for interacting with their database.
In fact, many new web developers see “writing raw SQL” or interacting directly with the database as something scary that should be avoided at all costs.
The reality is that relational databases are actually fairly easy to tame, and are built on top of lots of great ideas. Understanding the relational database that your application runs on will give you a much richer understanding of your web stack and make you a more powerful, proficient developer.
This article is a version of some notes I wrote for the new web developers who just started at Burstworks. At the end I link to the major resources I used, in case you want to learn more about this stuff.
Why Use Databases At All?
Before we dive into relational databasea, we should spend a minute thinking about why we’d want to use them at all. For example, our application could simply write JSON-encoded data to a file on disk, or use some csv-like “tables”.
The problem with these home-brewed solutions are many. For one, you don’t want to have to chase pointers around on the disk, reading and writing to files yourself. You also don’t get any “advanced” querying from these systems, and there’s really no way to optimize your lookups.
Once you have more than a few records in your application, you really need to use something more powerful, like a relational database.
Relational database management systems (RDBMS) tend to have the following characteristics:
- Massive. They can store terabytes of data -- usually much more than can fit into memory -- using disk as storage which since it's much cheaper.
- Persistent. Data outlives a script or application's execution, and can be accessed multiple ways at the same time
- Multi User. There's some sort of concurrency control that handles situations where multiple users read/write data at the same time
- Convenient Access. There are some high-level abstractions that make accessing and manipulating data much nicer for the application developer.
- Efficient. Databases can usually run thousands of simple queries per second and can optimize complex lookups so they're fast.
Databases have also been around for a really long time, so there’s lots of information to help you if you get stuck.
Convinced databases are worth using? Next we’ll look at some of the more academic underpinnings of relational databases, and some of the “math” behind them.
Relational Databases & Your Objects
Unless you have a specific reason, the data in your application should be organized in a fairly “normalized” fashion, mapping the real world objects that your application uses into classes.
These are your models and they make up the “M” in most MVC frameworks. Most frameworks will automatically take the models you define in your code, and translate those directly into relations (ie tables).
Formally speaking, a database is a set of “relations” (or tables). Each relation has a set of “attributes” (or columns), and each attribute has a type – either something atomic (float, int, string, etc) or more structured (date, time, etc). Within the relations, each “tuple” (or row) has a value for each attribute. You can imagine it looks something like this:
The names in the quotes above are the more formal labels for each concept, but people usually refer to them using the names I included in parentheses. For the rest of the article, I’ll refer to them as tables, columns and rows.
The ORM in your favorite web framework is saving each instance of a model as a new row in that model’s table. So if you have a model called “Movies” and you’ve created 20 movies in your application, then you have a table called “movies” and it’s got 20 rows.
The attributes of your models are what make up the columns in the table. That’s why adding a new attribute to your models usually requires a database migration – the RDBMS needs to change the structure of the corresponding table in the database to accommodate the new column.
Rows also contain a unique key, which is an attribute whose value is unique to that row. Keys are sometimes defined as a set of attributes that are unique when used together.
Keys are important because they’re used to uniquely identify a row inside a table, and they’re also used to define relationships between rows in different tables.
This is why a lot of ORMs will perform searches by primary key by default – those lookups are fast (thanks to indexes, which we look at below) and guaranteed to return a single, unique result.
Relational algebra may sound scary, but it’s not really that math-heavy. If you want to understand relational databases, you should really understand the basics of relational algebra.
Relational algebra is a formal language (meaning it’s just high-level concepts) but it forms the underpinnings of implemented languages like SQL (which we’ll look at in a bit).
Here are the core topics you need to know from relational algebra:
- Select (choose only certain rows)
- Project (choose only certain columns)
- Join (span relationships across tables)
- ... and a bunch of stuff from set theory (union, cross product, etc)
If you want to learn more about what each of these mean, skip down to the “resources” section at the bottom.
Because each of these operations are so formally specified, databases can look at a given query and come up with the least costly way of performing that operation.
This is called algebraic optimization and is one of the major reasons you should use relational databases if you want to run complex queries.
((2*z) + (3*z) + 0) / 1 can be optimized to simply
z * (2 + 3) – making it two operations instead of five – relational databases look at the query you’re giving it and figure out the most efficient way to run it without having to do things like scan the entire table multiple times.
Other database systems (like map-reduce) leave those optimizations up to the programmer.
Another important concept of relational algebra is closure. Specifically, every operation that applies to a table, also returns a table. This important property allows for operations to be chained together and queries to be nested inside each other.
There’s “closure” in the sense that you can’t escape the system by performing operations. Just like doing mathematical operations on numbers always returns other number, and not, say, letters. Running an operation on a table will always return another table.
Structured Query Language (SQL)
In a more hands-on sense, the language you actually use to talk to a relational database is some form of SQL, which stands for Structured Query Language.
SQL is a very standardized query language which has been around since the early 1970s, making it one of the oldest technologies that modern web developers still use on a daily basis.
The official standard is thousands of pages long, but you can get started pretty simply. There are tons of great tutorials and references around the internet, so I won’t cover the details here. But the most commonly used SQL operations provide exactly what you need to build a CRUD app:
INSERT ... INTO table;
SELECT ... FROM table;
UPDATE table SET ...;
DELETE FROM table WHERE ...;
SQL is a declarative language – as we’ve seen, you leave the how up to the database and simply tell it what you want.
By contrast, you’re probably used to building your applications using an imperative programming language, where you describe a sequence of commands you want the computer to take.
It takes a bit of getting used to at first, but it’s a really nice abstraction. You don’t have to describe how you want the RDBMS to scan the table and perform each operation. You just tell it what you want and let algebraic optimization take care of finding the most efficient way to get it done.
Indexes are an incredibly powerful feature of relational databases. In fact, using indexes properly can often improve the performance of your queries by an order of magnitude.
Indexes allow you to “ask questions” (ie find rows) without having to scan the entire table on disk. They’re really useful for answering “needle in a haystack” types of queries, where you’d otherwise have to look through many rows to pick out only the few that you need.
An index is a data structure that’s stored persistently on disk, and is typically implemented as either:
- "Balanced Tree" (ie B-Tree) -- fast for all sorts of comparisons (ie, >, <, ==)
- "Hash Table" -- only fast for equality comparison (ie, ==)
Without an index, performing a
SELECT ... FROM table WHERE ...; statement requires the RDBMS system to scan through every single row in the table to see if it matches the
WHERE clause. If you have more than a few hundreds rows in a table, this takes a non-negligible amount of time.
But by adding an index to a column that you frequently run lookups on, you can decrease this amount of time substantially. Instead of having to scan
n rows (where
n is the total number of rows in the database), the database can return your query by looking at either log(
n) rows (if indexes are implemented as a b-tree) or simply one row (if indexes are implemented as a hash table).
Once you’ve set an index on a particular column, you don’t need to do anything to make subsequent queries use it. The query planner in the RDBMS will automatically use the index if it needs to.
While these are awesome benefits to using indexes, there are a few downsides to be aware of. Since the index is stored as a separate data structure from the rest of the data in your table, indexes take up additional space on disk.
There’s also the overhead of index creation and maintenance. Whenever a new row is added to the table, the index’s data structure must also be updated so that the index remains accurate.
If you’re willing to sacrifice a bit of performance on writes to get much better performance on reads, then indexes are a great tool to use.
Transactions & ACID Compliance
If you care about not losing your data, transactions are one of the most essential features you’ll find in a relational databases.
Transactions solve two independent problems:
- Multiple users accessing the same row at the same time.
- Resilience to system crashes and failures.
A transaction is simply one or more operations that are treated as a single, atomic unit. Either the entire transaction happens, or none of it does. You are guaranteed to never have a partial completion.
A common example that’s used when talking about transactions is the notion of transferring funds from one bank account to another. You might have multiple operations such as debiting one account and crediting another, and it would be really bad if only some of those operations were successful while others failed. To be safe, all of those operations should be wrapped in a single transaction to ensure things stay consistent.
More formally, transactions are also used to guarantee that relational databases are “ACID compliant”. The four ACID properties are the core tenants of reliability for relational databases:
- Atomicity. If there's a crash or failure during a transaction before it's been fully committed, the database knows how to roll back to the way things were before the transaction. Nothing is left half done.
- Consistency. A query can't violate integrity constraints (ie writing a string to an integer field or leaving a
NOT NULLfield blank) or leave the database in an inconsistent state.
- Isolation. Changes made in one transaction aren't visible to any other transaction until the first one is completely finished. It must appear to each client that their transactions happened in order.
- Durability. Changes that have been committed are stored permanently.
These may seem like obvious features, but there’s a lot that goes on under the hood to ensure these conditions are always met.
For example, the database will lock certain rows or tables when updates are being performed to ensure multiple clients aren’t writing to the same rows at the same time. But those locks are granular enough that other parts of the database can still be updated so that multiple, unrelated queries can be processed at the same time.
To force a series of operations to behave as a single transaction, you wrap everything with
BEGIN...COMMIT. Some databases might have “autocommit” mode on where each operation is automatically wrapped in a transaction.
Downsides of Relational Databases
While relational databases come with a ton of powerful, battle-tested features out of the box, there are definitely some downsides to using them that are important to be aware of.
The most common complaint you hear about today is that relational databases “don’t scale” meaning they’re not really designed to be run across multiple machines in a cluster. Relational databases were developed in an era where data could easily fit on one hard drive and queries rarely took more than a few seconds. They’re not designed to work with huge amounts of data or queries that can run for hours and days.
If you’re really pushing the limits of what your database can store, you can try “sharding” your data – splitting it into different databases on multiple machines based on some shard key. But this is very difficult to implement and change once you get it set up. If your data and/or query load don’t fit on one machine, you might need to use something else besides an RDBMS.
Relational databases also have somewhat poor fault tolerance. While they have guarantees to ensure your data won’t become corrupted even in the event of hardware failure, they also require that you re-run any failed queries and simply try again. But if you’re trying to run long operations or use multiple machines, the likelihood of a failure mid-operation becomes nearly 100% and you won’t get much done if you need to keep restarting things.
Finally, while relational databases are good about enforcing integrity constraints, that doesn’t always make sense for every application. The schemas in a relational database are enforced when the data enters the RDBMS, making it impossible to store “dirty” data that still needs to be cleaned or processed in some way.
If you want to learn more about relational databases, I’d highly recommend checking out the following (free) video resources:
- Introduction to Databases by Jennifer Widom (Stanford)
- Introduction to Data Science by Bill Howe (University of Washington)
- From Relational Databases to MongoDB by Bryan Reinero (10Gen/MongoDB)
Still have questions? Maybe I can help. Shoot me an email here.