Lightning Fast Data Serialization in Python
A few months ago, I got a chance to dive into some Python code that was performing slower than expected.
The code in question was taking tiny bits of data off of a queue, translating some values from strings to primary keys, and then saving the data back to another queue for another worker to process.
The translation step should have been fast. We were loading the data into memory from a MySQL database during initialization, and had organized the data structure so that the id -> string lookups were constant time.
Finding the Problem
After letting the script run for awhile, the bottleneck jumped out right away – the workers were spending about 40% of their time serializing and deserializing data.
In order to keep messages on the queue for other workers to pick up, we were translating the Python dicts into JSON objects using the standard library’s
Our worker was reading the text data from the queue, deserializing it into a Python dict, changing a few values and then serializing it back into text data to save onto a new queue.
The translation steps were taking up about 40% of the total runtime.
So I set out to see if there was a faster way to serialize a Python dict.
When you’re optimizing code, it’s helpful to think about what sort of gains you’re looking for.
- Gaining a few percentage points faster isn't usually too challenging
- Gaining several times faster (ie 200-800%) requires more strategic thinking
- Gaining orders of magnitude in speed often requires rearchitecture or starting over
Since we were already using Python’s builtin
json module, we knew it’d be hard to eek out an order of magnitude improvement. But a few percentage points wasn’t going to cut it. It had to be a meaningful speedup in order to take a big chunk out of that 40% of time spent doing serialization/deserialization.
Each message of data was small – 5 keys with small values tipped the scales at a few dozen bytes each – so we weren’t worried about saturating the network card. Bandwidth and latency also weren’t a huge factor since the queue and all the workers were in the same availability zone on EC2.
I should note that all of the workers that’d be touching this data were in-house, so interoperability with common data serialization standards wasn’t a huge concern.
If the fastest way to encode data was to string it together with pipes
| and backslashes
, that was fine. We could update all of the workers to accommodate it.
I tried searching for as many Python data serialization libraries as I could find – as well as coming up with my own serialization schemes.
I learn a bunch about the performance of different string building functions while building my own
home_brew‘d serialization process. If you have any more ideas, let me know and I’ll be sure to add them!
Note that some packages require a file handle in order to write the serialized data, while others just dumped it to a string in-memory.
The overhead of opening and closing the file was undetectable on the order of time I was examining, but I commented the file-handling code out for the packages that didn’t need it, to simulate the actual cost of using each package in production.
Results & Observations
I ran the script 10 times for each package and made a mental average. That’s the number you see listed in the comments next to each function.
We switched to using ujson and saw a roughly 1/3rd overall increase in our pipeline processing speed, which was inline with our expectations from the test results.
Any packages I missed? Different ideas for home brewed serialization? Shoot me a note on Twitter.