A few months ago, I got a chance to dive into some Python code that was performing slower than expected.
The code in question was taking tiny bits of data off of a queue, translating some values from strings to primary keys, and then saving the data back to another queue for another worker to process.
The translation step should have been fast. We were loading the data into memory from a MySQL database during initialization, and had organized the data structure so that the id -> string lookups were constant time.
Finding the Problem
After letting the script run for awhile, the bottleneck jumped out right away — the workers were spending about 40% of their time serializing and deserializing data.
In order to keep messages on the queue for other workers to pick up, we were translating the Python dicts into JSON objects using the standard library’s
Our worker was reading the text data from the queue, deserializing it into a Python dict, changing a few values and then serializing it back into text data to save onto a new queue.
The translation steps were taking up about 40% of the total runtime.
So I set out to see if there was a faster way to serialize a Python dict.