Everything I tell my clients about how to purchase, configure and manage a strong proxy infrastructure for web scraping.
A bad client can be an absolute nightmare, robbing you of your time, sanity and in some cases – your paycheck. Here’s what I’ve learned to look out for over the last 3 years.
Before you launch your new product to the world, make sure you’re following the basic security guidelines in this checklist.
I’ve now been an independent consultant for longer than I was ever a full time employee. I figured it was a good opportunity to peel back the curtain a bit and offer a ‘behind the scenes’ look at what life as a consultant has been like for me.
The USB Rubber Ducky is an awesome device for infosec testing and general mischief. Here are the missing step-by-step instructions for setting it up.
A few days ago, the Facebook Messenger team launched version 1.4 of their platform. The most prominent changes for most bot developers are the enhancements to the persistent menu options.
A lot of commentators noticed that the updates seem to steer bot developers away from building conversational, chat based interfaces. Indeed, there is now the ability for bot developers to completely disable the text composer for users, and requiring them to tap or click on buttons for every interaction.
Many people are already heralding this as the beginning of the end for chat bots – if Facebook seems to be discouraging bot developers from using the primary chat interface, maybe chat bots aren’t really living up to the hype!
The truth is… right now, they’ve got a point.
After four years, I figured that my book, The Ultimate Guide to Web Scraping was due for an upgrade.
The internet has changed, websites have gotten more advanced, and I’ve learned a lot of new skills and tactics over the years that I wanted to incorporate into the book.
I’ve also gotten a lot of feedback from some of the over 1,300 readers that helped to make the second edition of the book even more thorough and complete.
The book has been largely re-written and re-organized to focus on simple concepts, show you them in action, and then build on them for more advanced web scraping use cases.
There are over 40 new pages with many more python code samples as well as better coverage of more advanced topics. The book also includes a discount code for my online web scraping video course: Scrape This Site (a $50 value).
Once you’ve put together enough web scrapers, you start to feel like you can do it in your sleep. I’ve probably built hundreds of scrapers over the years for my own projects, as well as for clients and students in my web scraping course.
Occasionally though, I find myself referencing documentation or re-reading old code looking for snippets I can reuse. One of the students in my course suggested I put together a “cheat sheet” of commonly used code snippets and patterns for easy reference.
I decided to publish it publicly as well – as an organized set of easy-to-reference notes – in case they’re helpful to others.
While it’s written primarily for people who are new to programming, I also hope that it’ll be helpful to those who already have a background in software or python, but who are looking to learn some web scraping fundamentals and concepts.
This past summer, I launched a web scraping class targeted at total beginners.
After talking to a lot of people who were having web scraping challenges over the past few years, I realized that many of them weren’t coders and didn’t have any technical backgrounds. Many people who were looking to learn web scraping had never coded before.
So, when I was making the material and lessons for the course, I had to keep this in the front of mind. I couldn’t expect students to know what a for-loop is or how to spot syntax errors in their text editors.
I really tried to hold students’ hands through all of the class material, writing the simplest possible code, explaining things line by line as I went, and intentionally running into common mistakes to show them how to power through.
But even with all of that assistance, the real world is a lot messier. Students would begin to stumble when they’d go out on their own. Their code “wouldn’t work” and they’d feel hopelessly stuck pretty quickly.
I realized that the process of troubleshooting and fixing the bugs in your code isn’t intuitive yet to anyone who hasn’t spent a long time learning to code already. So I decided I’d collect some of my most common tips for beginners who feel stuck when their code is giving errors or isn’t doing what they want it to do.
While these tips are generally geared towards beginners, many of them are actually the tips I still follow day-to-day when debugging issues with my own code. Even as someone who has been coding full time for over 7 years, I use these steps myself constantly (especially #9).
In its simplest form, web scraping is about making requests and extracting data from the response. For a small web scraping project, your code can be simple. You just need to find a few patterns in the URLs and in the HTML response and you’re in business.
But everything changes when you’re trying to pull over 1,000,000 products from the largest ecommerce website on the planet.
When crawling a sufficiently large website, the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.
This was my first time doing a scrape of this magnitude. I made some mistakes along the way, and learned a lot in the process. It took several days (and quite a few false starts) to finally crawl the millionth product. If I had to do it again, knowing what I now know, it would take just a few hours.
In this article, I’ll walk you through the high-level challenges of pulling off a crawl like this, and then run through all of the lessons I learned. At the end, I’ll show you the code I used to successfully pull 1MM+ items from amazon.com.
I’ve broken it up as follows: