At Continua AI, we developed an AI podcast app which allows you to quickly create a customized podcast about anything you can imagine. Although our application is, at its core, powered by large language models, we wanted to create something more compelling than the kind of thing you get when you ask ChatGPT et al., “write me a podcast dialogue about topic X”. We imagined a product which could synthesize and analyze the torrential overload of information available on the internet and turn it into a customized podcast that you’d actually love to listen to, learn new things from, and help you stay informed about your desired topics. In order to have the freshest, most relevant and diverse information, we need to download, index, score, and retrieve information from the web.
Scraping the Web
You can’t really hope to scrape and index the entire internet on the cheap. Sure, other companies are doing that at scale, but the internet is a vast place with a lot of documents, and we’re a lean startup, so we have to be smart and judicious with what, how, when, and why we scrape in order to get the biggest bang for the buck. Here’s a few tips we learned along the way:
Scraping Tips
Don’t
Before you jump straight in, do you actually need to scrape (as in, send actual HTTP requests to actual servers in real-time)? Can you use Common Crawl, which provides an updated archive of web content that has already been scraped? If you can get away with using that source or something similar, your crawling costs are zero. Also, remember that it isn’t only your own costs, but also the cost to the site operator that you should consider. The more of us who skip this step (and yes, we decided to scrape directly - because we wanted fresh and timely updates so that the podcasts would have the latest information), the more site operators will invest in protecting their sites from scrapers, which will also disincentivize smaller site operators and publishers from distributing their content on the web in the first place. Everyone loses. So, if you don’t need to scrape, just don’t do it. And if you do need to do it, be sure to follow the other tips in this article.
Have I seen this URL before?
In your scraper, implement an external store of all the URLs you have scraped already so that you can skip them the second, third, forty-fifth time you see them. The most efficient way to send an HTTP request is to not send it, because you already did. Even better, load the Common Crawl index into your external store so that you don’t bother scraping pages that are already in the Common Crawl. Of course, this method may not work well with dynamic pages that change over time, but many pages on the internet will work well with this method. For dynamic pages, how often do you really need to re-scrape that page? Try to find the absolute maximum interval that your product can stomach.
Note that URLs will often appear with a bunch of unnecessary query arguments that are just for tracking. Be sure to strip these out (you can find a list online to start with, and also periodically add to it by monitoring the URLs you are scraping for patterns) before checking whether you have seen the URL before, otherwise, these links will constantly seem unique although they are really not.
IP proxies are not really “it” anymore
One of the common things that you’ll read online is that for scraping, you want to use a proxy to avoid getting blocked. The reality is that bot/scrape detection is much more advanced today than it was when a lot of this advice was first circulated and it can detect a likely bot/scraper using more sophisticated methods, like “fingerprinting” the browser using either simple (user agent, cookies, window size) or nuanced (render timing, feature flag detection, fonts, anti-aliasing) methods. So to make a long story short, yes, an IP proxy may help you scrape a few sites that detect and block you through other methods, but that is just one tool in the ever-expanding toolbox. If you operate a low-touch and well-behaved bot, you may be surprised to find out that many sites will let you pass despite coming from a cloud provider IP address. However, the landscape is constantly evolving. Today many large and small publishers use services provided by Cloudflare and Akamai and others to detect and block scrapers writ large.
Ultimately the tack that you have to take on anti-bot / anti-scrape avoidance depends on the sites that you are aiming to scrape. Some sites are much harder to scrape than others. Generally speaking, the more valuable the data is, the harder it is to scrape: social media, arbitrage opportunities (event tickets, airline/hotel/car rental pricing, cryptocurrency exchange, etc), premium news, these sites will typically work the hardest to protect their data from being scraped. You may want to think deeply about why that is the case before setting out immediately on the quest to crack open their defenses.
Save your work
It can be tempting to download a page over HTTP, then parse the HTML and extract what you need, and only save the extracted content. However, it is very common to realize only later that you didn’t extract the thing that you really needed, or perhaps it moved, or perhaps you just wish you had extracted something else, too. If this happens, you will have to re-scrape every single document in order to extract the other thing that you missed. Instead, save the full response from the original request in its entirety, so that you can go back and access it again later to perform extraction. We use cloud storage since our scrapers are distributed, but if you are just getting started, you can fit a lot of webpages on your hard disk! Make sure to not request images, pdfs, fonts, and any other large files that you don’t need (remember - don’t just not save them, make sure to also not request them in the first place).
Wait on line
As consumers, we expect websites and apps to load at blazing fast speeds. Site operators spend a lot of time and engineering energy to try to make this so. As scrapers proliferate, they can harm the ability of sites to serve their human users, because more and more of the server capacity can be used up serving clients which, unlike humans, are actually extremely latency insensitive. A human will quickly get bored and wander away, perhaps leaving that impulse buy lingering in their cart un-purchased, that article unread - a site operator’s worst nightmare. But a scraper has all the time in the world to wait! So be sure to rate-limit your scraper so that it is not consuming unnecessary resources from the server - for example, limit HTTP requests to the same domain to one every three seconds for a very conservative scrape that is unlikely to disrupt human traffic. Check robots.txt for the right amount of time to wait (the “Crawl-delay” directive) if the operator has configured it. Otherwise, there may not be a single “right” rate limit to choose, since it depends on the other clients of the site and the server resources available. One trick I recommend is to use latency as a signal, multiplied by three. So if a HTTP response takes 5 seconds, wait 15 seconds before the next request, giving other clients a chance. This is a way that any server can send the client a signal as to how loaded it is, regardless of whether it is a Microsoft IIS 2.0 server running on a forgotten VM or the latest Nginx build running on a horizontally auto-scaling cloud deployment.
Use open source
If you are just getting started, pick an existing scraping framework in your favorite language/ecosystem. Many of the top frameworks/libraries implement some of these “best practices” by default, or make them easier to achieve. It is tempting to whip out your ecosystem’s HTTP library and start blasting requests – try to resist that urge and learn from the prior art. It makes it easier to proliferate low-touch, well-behaved scrapers that will minimally escalate the anti-bot, anti-scrape arms race. I personally recommend Scrapy. It is likely that you will want to choose a framework that integrates with Playwright and/or Puppeteer in order to control a real browser (Scrapy supports this, as does Crawlee, as do other libraries). If you choose a library which mimics a browser but does not actually control a browser, eventually you may discover some site or anti-scrape detection that you simply cannot get around (although, even with a real browser, you can still be detected and blocked). However, for lightweight scraping, you may not find use of a browser necessary.
Conclusion
Data from the web is important for building new kinds of products and experiences. At Continua, data from the web feeds our AI podcasts product and our AI Agent Platform in general. However, in building a web data pipeline, developers should also consider the impact that scraping can have on site operators and use best practices that minimize the burden that our client activities place on the servers. The internet is a remarkable confluence of commercial and non-commercial interests cohabitating according to a certain set of norms and standards and scrapers should carefully consider their role in this delicate ecosystem.