Hartley Brody

Many content producers or site owners get understandably anxious about the thought of a web scraper culling all of their data, and wonder if there’s any technical means for stopping automated harvesting.

Unfortunately, if your website presents information in a way that a browser can access and render for the average visitor, then that same content can be scraped by a script or application.

Any content that can be viewed on a webpage can be scraped. Period.

A content thiefYou can try checking the headers of the requests – like User-Agent or Cookie – but those are so easily spoofed that it’s not even worth doing.

You can see if the client executes Javascript, but bots can run that as well. Any behavior that a browser makes can be copied by a determined and skilled web scraper.

But while it may be impossible to completely prevent your content from being lifted, there are still many things you can do to make the life of a web scraper difficult enough that they’ll give up or not event attempt your site at all.

Having written a book on web scraping and spent a lot of time thinking about these things, here are a few things I’ve found that a site owner can do to throw major obstacles in the way of a scraper.

There are many reasons to begin freelancing while still maintaining a full time job. Whether you want to work on new projects outside the scope of your current role or make some extra money each month – or maybe you’re hoping to eventually jump into freelancing full time – getting your first few clients can be really rewarding.

But how should you get started? I’ll cover that in this article.

We’ll go over ways to build your expertise and generate demand for your time. Then we’ll talk about generating potential leads and how to convert them into paying clients. Finally, we’ll talk about some tips for your pricing conversations.

In a future article, I’ll talk about different styles of project management, easy ways to exceed your clients’ expectations, how to handle some of the legal and administrative issues you’ll face, and how to feel comfortable raising your rates. Make sure to subscribe for updates!

Freelance tips

Personally, I’ve worked with dozens of clients over the past few years across several different business problem domains. Some of my clients are one-person operations while others are large organizations that have run Super Bowl ads.

I learned a bunch from both my successes and my mistakes along the way as I was getting started, so I figured I’d put together a guide for other people who might want to follow a similar path.

I’ve got an old Rackspace instance that I’ve been running a bunch of small sites on over the past 4 years. Lately it’s been causing me problems and sites will sporadically go down from time to time.

I have been meaning to move several of the static sites onto a more appropriate static-file hosting service like Amazon’s Simple Storage Service, also known as simply “S3”.

I’m on a trip in Denver with my girlfriend right now, so when I woke up to an email that one of my sites was down again, the last thing I wanted to do was waste precious vacation time doing server ops.

Woman in the shower

Fortunately, moving the static sites to s3 was so easy, I was able to get it done before my girlfriend even got out of the shower. No vacation time wasted!

Growing up, I was always an outdoorsy person, but cold New England winters kept me cooped up inside for a big chunk of the year. Last winter, I decided to take my first winter mountaineering and ice climbing lessons to start building the skills to become a year-round adventurer.

Preparing for a hike up Mt. Washington

Earlier this winter, a friend told me about MIT Outing Club’s annual Winter School in January. It was 16 hours of lectures, demonstrations and stories from trip leaders and outside speakers. The course was a great introduction for anyone looking to get outside more in the winter.

Guest lecturer at MIT Outing Club's Winter School

I’ve compiled some of my notes from the course here, and added my own anecdotes that I’ve picked up over the past year. While reading about this stuff is a great way to whet your appetite, some of the skills and more technical aspects should really be practiced before you go out and try using them.

I’d highly recommend Northeast Mountaineering’s Introduction to Mountaineering course if you’re in New England.

Most web developers building dynamic websites interact with databases every day. Relational databases like MySQL or Postgres are usually the first tool people reach for when their application needs to store data.

database-iconBut with the recent proliferation of web frameworks like Rails and Django, many web developers rely totally on Object-Relational Mappers (ORMs) for interacting with their database.

In fact, many new web developers see “writing raw SQL” or interacting directly with the database as something scary that should be avoided at all costs.

The reality is that relational databases are actually fairly easy to tame, and are built on top of lots of great ideas. Understanding the relational database that your application runs on will give you a much richer understanding of your web stack and make you a more powerful, proficient developer.

This article is a version of some notes I wrote for the new web developers who just started at Burstworks. At the end I link to the major resources I used, in case you want to learn more about this stuff.

Feb 2017 Edit: The book has been updated.

web-scraping-ebook-coverI wrote an article on web scraping last winter that has since been viewed almost 100,000 times. Clearly there are people who want to learn about this stuff, so I decided I’d write a book.

A few months later, I’m happy to announce: The Ultimate Guide to Web Scraping.

No prior knowledge of web scraping is necessary to follow along – the book is designed to walk you from beginner to expert, honing your skills and helping you become a master craftsman in the art of web scraping.

The book talks about the reasons why web scraping is a valid way to harvest information – despite common complaints. It also examines various ways that information is sent from a website to your computer, and how you can intercept and parse it. We’ll also look at common traps and anti-scraping tactics and how you might be able to thwart them.

There are code samples in both Ruby and Python – I had to learn Ruby just so I could write the code samples! If anyone’s willing to translate the sample code into PHP or Javascript, I’ll give you a free copy of the book. Get in touch.

https certificate iconHow does HTTPS actually work? That was the question I set out to solve a few days ago for a project at work.

As a web developer, I knew that using HTTPS to protect users’ sensitive data was A Very Good Idea, but I didn’t have much understanding about how it actually worked.

How was data protected? How can a client and server create a secure connection if someone was already listening in on the wire? What is a security certificate and why do I need to pay someone to get one?

A Series of Tubes

Before we dive into how it all works, let’s talk briefly about why it’s important to secure connections in the first place, and what sorts of things HTTPS guards against.

When you make a request to visit your favorite website, that request must pass through many different networks – any of which could be used to potentially eavesdrop or tamper with your connection.

series of tubes

From your own computer to other machines on your local network, to the access point itself, through routers and switches all the way to the ISP and through the backbone providers, there are a lot of different organizations who ferry a request along. If a malicious user got into any one of those systems, then they have the potential to see what’s traveling through the wire.

Normally, web requests are sent over regular ol’ HTTP, where a client’s request and the server’s response are both sent as plain text. There are lots of good reasons why HTTP doesn’t use secure encryption by default:

  • Security requires more computation power
  • Security requires more bandwidth
  • Security breaks caching

But sometimes, as the developer of a web application, you know that sensitive information like passwords or credit card data will be going over the connection, so it’s necessary to take extra precautions against snooping on those pages.

In the software world, the terms “developer” and “engineer” are often used interchangeably to mean “someone who builds things with code.”

Sometimes the word “hacker” gets thrown into the mix if the company is a startup or is trying to make an open job position sound more enticing.

But what does it mean for someone to be a “developer” versus an “engineer”? Does it matter? If you’re trying to “level up” or make a career out of writing code, I’d say it matters a lot.


baby-on-phone“Mobile” is easily one of the biggest buzzwords of the decade, and everyone is starting to grok it – from marketers and web developers to your parents and grandparents. But what do we really mean when we talk about “mobile” in a development context?

Usually, “mobile development” refers to developing native applications in Objective C or Java that take advantage of device-specific-APIs. In that sense, it might be more accurate to call it “iOS development” or “Android development,” since it’s really just development for a particular platform.

But sometimes the term “mobile” is used in a web context. What does it mean then? Usually, a few things:

  • Small screen size, no space can be wasted.
  • Constrained bandwidth, don't load anything that you don't need.
  • Slower processor, don't do any heavy rendering in the client.

However, these are design qualities that most networked application should strive for.

With the rise of high-density displays and powerful mobile processors, there are undoubtedly many “mobile” devices with larger screens, faster networks and beefier processors than many desktops from a decade ago. Today, your cellphone has more computing power than NASA did when it put a man on the moon in 1969.

“Mobile” is relative.

faviconI launched another small app yesterday called Rooster, an SMS service that messages you the forecast every morning.

It only took a few days to build and polish. But while the interface is remarkably simple, there’s a lot going on behind the scenes:

  • User-provided location strings have to be converted to latitude & longitude pairs.
  • Weather reports from dozens of agencies have to be aggregated and their raw data has to be converted into human-readable forecasts.
  • Text message have to be delivered reliably through various carriers to phone numbers all over the world (so far including China, Australia, Germany, Canada and Algeria).
  • Timing information has to be reliably stored across various timezones around the world, so that messages are sent at the correct time, in the local user's timezone.

And those are just the specifics for the this particular app. As with any application that’s available over a network, I also had to ensure that the following conditions were met.