Lightning Fast Data Serialization in Python

  • Monday, December 8th, 2014 10:55 pm GMT -5

A few months ago, I got a chance to dive into some Python code that was performing slower than expected.

lightning-fast-serialization-pythonThe code in question was taking tiny bits of data off of a queue, translating some values from strings to primary keys, and then saving the data back to another queue for another worker to process.

The translation step should have been fast. We were loading the data into memory from a MySQL database during initialization, and had organized the data structure so that the id -> string lookups were constant time.

Finding the Problem

In order to figure out where the bottleneck(s) were, I used Python’s builtin CProfile package, and combed through the results using the awesome CProfileV package, written by a former Quora intern.

After letting the script run for awhile, the bottleneck jumped out right away — the workers were spending about 40% of their time serializing and deserializing data.

In order to keep messages on the queue for other workers to pick up, we were translating the Python dicts into JSON objects using the standard library’s json package.

Our worker was reading the text data from the queue, deserializing it into a Python dict, changing a few values and then serializing it back into text data to save onto a new queue.

The translation steps were taking up about 40% of the total runtime.

So I set out to see if there was a faster way to serialize a Python dict.


Preventing Web Scraping: Best Practices for Keeping Your Content Safe

  • Monday, August 11th, 2014 08:17 pm GMT -5

Many content producers or site owners get understandably anxious about the thought of a web scraper culling all of their data, and wonder if there’s any technical means for stopping automated harvesting.

Unfortunately, if your website presents information in a way that a browser can access and render for the average visitor, then that same content can be scraped by a script or application.

Any content that can be viewed on a webpage can be scraped. Period.

A content thiefYou can try checking the headers of the requests — like User-Agent or Cookie — but those are so easily spoofed that it’s not even worth doing.

You can see if the client executes Javascript, but bots can run that as well. Any behavior that a browser makes can be copied by a determined and skilled web scraper.

But while it may be impossible to completely prevent your content from being lifted, there are still many things you can do to make the life of a web scraper difficult enough that they’ll give up or not event attempt your site at all.

Having written a book on web scraping and spent a lot of time thinking about these things, here are a few things I’ve found that a site owner can do to throw major obstacles in the way of a scraper.


The Full Time Employee’s Guide to Generating Freelance Clients on the Side

  • Monday, July 7th, 2014 06:26 pm GMT -5

There are many reasons to begin freelancing while still maintaining a full time job. Whether you want to work on new projects outside the scope of your current role or make some extra money each month — or maybe you’re hoping to eventually jump into freelancing full time — getting your first few clients can be really rewarding.

But how should you get started? I’ll cover that in this article.

We’ll go over ways to build your expertise and generate demand for your time. Then we’ll talk about generating potential leads and how to convert them into paying clients. Finally, we’ll talk about some tips for your pricing conversations.

In a future article, I’ll talk about different styles of project management, easy ways to exceed your clients’ expectations, how to handle some of the legal and administrative issues you’ll face, and how to feel comfortable raising your rates. Make sure to subscribe for updates!

Freelance tips

Personally, I’ve worked with dozens of clients over the past few years across several different business problem domains. Some of my clients are one-person operations while others are large organizations that have run Super Bowl ads.

I learned a bunch from both my successes and my mistakes along the way as I was getting started, so I figured I’d put together a guide for other people who might want to follow a similar path.


Moving a Static Site to S3 Before My Girlfriend Got Out of the Shower

  • Friday, June 6th, 2014 04:42 pm GMT -5

I’ve got an old Rackspace instance that I’ve been running a bunch of small sites on over the past 4 years. Lately it’s been causing me problems and sites will sporadically go down from time to time.

I have been meaning to move several of the static sites onto a more appropriate static-file hosting service like Amazon’s Simple Storage Service, also known as simply “S3″.

I’m on a trip in Denver with my girlfriend right now, so when I woke up to an email that one of my sites was down again, the last thing I wanted to do was waste precious vacation time doing server ops.

Woman in the shower

Fortunately, moving the static sites to s3 was so easy, I was able to get it done before my girlfriend even got out of the shower. No vacation time wasted!


Becoming a Cold Weather Adventurer: Notes from MIT Outing Club’s Winter School

  • Saturday, February 15th, 2014 03:15 pm GMT -5

Growing up, I was always an outdoorsy person, but cold New England winters kept me cooped up inside for a big chunk of the year. Last winter, I decided to take my first winter mountaineering and ice climbing lessons to start building the skills to become a year-round adventurer.

Preparing for a hike up Mt. Washington

Earlier this winter, a friend told me about MIT Outing Club’s annual Winter School in January. It was 16 hours of lectures, demonstrations and stories from trip leaders and outside speakers. The course was a great introduction for anyone looking to get outside more in the winter.

Guest lecturer at MIT Outing Club's Winter School

I’ve compiled some of my notes from the course here, and added my own anecdotes that I’ve picked up over the past year. While reading about this stuff is a great way to whet your appetite, some of the skills and more technical aspects should really be practiced before you go out and try using them.

I’d highly recommend Northeast Mountaineering’s Introduction to Mountaineering course if you’re in New England.


Peeling Back the ORM: Demystifying Relational Databases For New Web Developers

  • Tuesday, November 19th, 2013 10:48 pm GMT -5

Most web developers building dynamic websites interact with databases every day. Relational databases like MySQL or Postgres are usually the first tool people reach for when their application needs to store data.

database-iconBut with the recent proliferation of web frameworks like Rails and Django, many web developers rely totally on Object-Relational Mappers (ORMs) for interacting with their database.

In fact, many new web developers see “writing raw SQL” or interacting directly with the database as something scary that should be avoided at all costs.

The reality is that relational databases are actually fairly easy to tame, and are built on top of lots of great ideas. Understanding the relational database that your application runs on will give you a much richer understanding of your web stack and make you a more powerful, proficient developer.

This article is a version of some notes I wrote for the new web developers who just started at Burstworks. At the end I link to the major resources I used, in case you want to learn more about this stuff.


The “Ultimate Guide to Web Scraping” is Now Available

  • Sunday, August 4th, 2013 09:45 pm GMT -5

web-scraping-ebook-coverI wrote an article on web scraping last winter that has since been viewed almost 100,000 times. Clearly there are people who want to learn about this stuff, so I decided I’d write a book.

A few months later, I’m happy to announce: The Ultimate Guide to Web Scraping.

No prior knowledge of web scraping is necessary to follow along — the book is designed to walk you from beginner to expert, honing your skills and helping you become a master craftsman in the art of web scraping.

The book talks about the reasons why web scraping is a valid way to harvest information — despite common complaints. It also examines various ways that information is sent from a website to your computer, and how you can intercept and parse it. We’ll also look at common traps and anti-scraping tactics and how you might be able to thwart them.

There are code samples in both Ruby and Python — I had to learn Ruby just so I could write the code samples! If anyone’s willing to translate the sample code into PHP or Javascript, I’ll give you a free copy of the book. Get in touch.