Home / Blog / Web Scraping / How to Make Web Scraping Faster
Speed up web scraping with optimization techniques, proxy usage, and API solutions. Learn how to avoid slowdowns and extract data efficiently.
Web scraping is a powerful way for you to collect data from websites and turn scattered unstructured information into actionable insights. However, if you don’t know how to scrape efficiently, the process can often be slow, frustrating, and prone to challenges, like IP bans, rate limits, and server issues.
In this guide, you’ll learn how to identify and overcome common problems that can slow down the web scraping process.
As you start scraping larger data sets, you’ll face frustrating challenges that slow down your process. Common culprits include the following:
To help you learn how to speed up web scraping processes, this tutorial teaches you how to scrape Open Library, a website that offers an extensible catalog of book records.
Note: Before scraping any website, make sure that you review and abide by the website’s privacy policy and terms of use. Many sites have explicit rules against scraping, and adhering to these policies is crucial.
In this tutorial, you’ll build a scraper for Open Library that extracts books about birds.
The first thing you need to do is inspect the elements of the page to identify the HTML structure. You can do this by right-clicking any element of the web page and selecting Inspect. This opens the browser’s Developer Tools, where you can see the HTML tags associated with each part of the page:
To scrape book titles, you need to inspect the page and see that each title is contained within an <h3> tag inside a specific structure.
<h3>
After you’ve examined the HTML of the page, you need to set up a Python environment. Make sure you have Python 3.x or newer installed and create a virtual environment in your project directory:
python -m venv venv
To activate your environment, run source venv/bin/activate on macOS and Linux, and venv\Scripts\activate on Windows.
source venv/bin/activate
venv\Scripts\activate
After activating the environment, install the following libraries:
pip install requests beautifulsoup4
Requests is used to send HTTP requests to the target website, and Beautiful Soup parses the HTML content of the pages.
To demonstrate how manual scraping works, let’s take a look at a basic scraping script. The following code scrapes the first fifty pages returned when you search for “birds” on the search page. Then, it saves them to a file:
import requests from bs4 import BeautifulSoup import time # Create/open a file to write the titles with open('book_titles.txt', 'w', encoding='utf-8') as f: start_time = time.time() # Loop through the first 50 pages for page in range(1, 51): url = f"https://openlibrary.org/search?q=birds&mode=everything&page={page}" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract book titles from the current page for book in soup.select('h3 > a'): title = book.text f.write(title + '\n') print(f"{(time.time() - start_time):.2f} seconds")
This script uses the Requests and Beautiful Soup libraries to scrape book titles related to birds from the Open Library website. It sends an HTTP GET request to each page, extracts the book titles using the select() method to find <a> tags nested within <h3> tags, and saves the titles to a file named book_titles.txt. The time library measures how long the entire scraping process takes.
select()
<a>
book_titles.txt
time
To run this script, save it to a file named app.py and execute it using python app.py on a 75 Mbps connection. This script took approximately 164 seconds. Keep in mind this timing may vary based on your network speed and hardware.
app.py
python app.py
While this approach is simple, it can be quite slow as each request is sent sequentially. Additionally, without any retry or optimization, it’s vulnerable to issues like IP blocks or rate limiting.
To speed up the scraping process and make it more reliable, let’s apply a few optimization techniques:
Instead of making requests sequentially, you can make them concurrently to reduce scraping time. The Python aiohttp library is an HTTP client similar to Requests that allows you to send asynchronous nonblocking HTTP requests. Combined with asyncio, which manages asynchronous operations, you can efficiently send multiple requests at once.
To use the Python aiohttp and asyncio libraries to send multiple requests, start by installing aiohttp:
pip install aiohttp
Here is an optimized version of the scraping code using aiohttp and asyncio:
import aiohttp import asyncio from bs4 import BeautifulSoup import time async def fetch_and_process_page(session, url): async with session.get(url) as response: html = await response.text() soup = BeautifulSoup(html, 'html.parser') return [book.text for book in soup.select('h3 > a')] async def main(): start_time = time.time() async with aiohttp.ClientSession() as session: with open('book_titles.txt', 'w', encoding='utf-8') as f: # Create and execute tasks for all pages tasks = [ fetch_and_process_page(session, url = f"https://openlibrary.org/search?q=birds&mode=everything&page={page}") for page in range(1, 51) ] results = await asyncio.gather(*tasks) # Write all titles at once for titles in results: for title in titles: f.write(title + '\n') print(f"{(time.time() - start_time):.2f} seconds") asyncio.run(main())
To run this code, create a new file named async_scraper.py and run it using python async_scraper.py.
async_scraper.py
python async_scraper.py
You’ll notice that using aiohttp drastically reduces the scraping time compared to the single-request method. This test took only 14.39 seconds.
aiohttp
Websites often block requests if they see too many from the same IP address. A common way to avoid this is to use a proxy, which acts as an intermediary between your computer and the internet. It masks your IP address, making it appear as though your requests are coming from a different location.
To further enhance this, you can use rotating proxies, which distribute your requests across multiple IPs. In this scenario, each request appears to come from a different IP address, making it much less likely that you’ll be blocked.
You can find free proxy services online or use libraries that return free proxies from a curated list. Keep in mind that free proxies are shared among many users, increasing the chances of getting blocked. For more reliable and efficient scraping, using managed proxy services is recommended. These services offer rotating proxies that automatically distribute your requests across a large pool of IP addresses, ensuring higher success rates and faster performance.
Here’s how you can modify the previous script to use a simple rotating proxy setup:
import requests import random from bs4 import BeautifulSoup # List of proxies proxies = [ "http://username:[email protected]:port", "http://username:[email protected]:port", "http://username:[email protected]:port" ] # Loop through pages with rotating proxies for page in range(1, 51): url = f"https://openlibrary.org/search?q=birds&mode=everything&page={page}" proxy = random.choice(proxies) response = requests.get(url, proxies={"http": proxy, "https": proxy}) soup = BeautifulSoup(response.text, 'html.parser') for book in soup.select('h3 > a'): print(f"Scraped: {book.text}")
This code defines a list of proxies and randomly chooses one for each request using the random.choice() function. This approach allows you to rotate through different proxies, significantly reducing the risk of getting blocked.
random.choice()
Rotating proxies is especially helpful when scraping large volumes of data as they ensure your requests appear to come from different users, making it harder for servers to identify you as a bot.
When scraping, some requests may fail due to network issues or rate limits. Implementing automatic retries can help you make your scraper more robust.
Here’s an example of how to add retry logic using the Requests library:
import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.util.retry import Retry from bs4 import BeautifulSoup session = requests.Session() retry = Retry(total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504]) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) url = "https://openlibrary.org/search?q=birds&mode=everything&page=1" response = session.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract book titles for book in soup.select('h3 > a'): print(f"Scraped: {book.text}")
This code snippet uses Retry from urllib3 to retry failed requests up to five times with exponential backoff. This means that if a request fails, it waits a bit before trying again, increasing the wait time with each failure. This approach helps ensure that temporary issues, such as network hiccups or rate limits, don’t stop your scraping process entirely.
Retry
urllib3
While the previous optimizations are useful, managing retries, proxies, and rate limits can be cumbersome and time-consuming. Alternatively, you can use a managed Web Scraper API which automatically handles proxies, bypasses CAPTCHAs, and manages rate limits, removing the need for manual intervention and reducing complexity. The API is built for scalability, allowing thousands of URLs to be scraped simultaneously without risking server bans. It also provides data in structured formats like JSON or CSV, making it easy to integrate the results directly into your projects.
By using such an API, you can focus more on analyzing data and extracting insights rather than maintaining the scraping infrastructure.
In addition to using a managed scraping API, you can further streamline your data collection process with precollected data sets. These data sets are compiled from over a hundred popular websites, such as LinkedIn, Amazon, Twitter, and Airbnb, and cover a wide range of topics, providing you with clean, ready-to-use data. Utilizing these data sets allows you to focus on analyzing data and extracting insights while avoiding the complexities of data collection.
Speeding up web scraping requires a combination of technical optimizations and smart tools. In this guide, you explored the following:
Whether you’re working alone or within a data team, leveraging advanced tools and strategies is designed to enhance the speed and reliability of your web scraping processes. By combining efficient coding practices with managed services, you can focus on what truly matters—extracting actionable insights from your data.
Several factors can slow down web scraping, including high page load times, JavaScript-heavy websites, rate limits, CAPTCHAs, and inefficient scraping code. Optimizing these aspects can improve scraping speed.
You can speed up web scraping by using asynchronous requests, rotating proxies, caching data, minimizing unnecessary requests, and leveraging headless browsers or APIs.
Proxies help distribute requests across multiple IP addresses, reducing the chances of getting blocked. Rotating proxies can improve speed and access to restricted content.
Using a managed web scraping API can save time and resources by handling IP rotation, bypassing CAPTCHAs, and ensuring reliable data extraction without extensive coding.
11 min read
Jonathan Schmidt
4 min read
Wyatt Mercer
7 min read