Home / Blog / Web Scraping / Web Scraping with Crawlee
Today, we live in a data-driven world where even the smallest decisions, like product recommendations, are made based on data driven insights. Companies use various methods to collect data, and web scraping is the most popular method among them.
So, as developers, it is essential to get hold of different tools and techniques that can be used for web scraping so you can contribute more towards the business output.
Therefore, let’s discuss how easily you can get started with web scraping using Crawlee and how to improve the scraping process using proxies.
Crawlee is an open-source web scraping and browser automation library. It has more than 14000 GitHub stars, 22500+ weekly NPM downloads, and suitable scraping modern web applications that heavily rely on JavaScript.
If you’re using Crawlee, there’s several key features that you should know of. Some of the most important features are:
Getting started with Crawlee is pretty straightforward. Here is what you need to do:
Before getting started, it’s important to make sure that you have the following installed:
You can install Crawlee using Crawlee CLI or manually add it to an existing project.
Using Crawlee CLI
To install Crawlee from the CLI, you can use the below command. It will install all the dependencies and create a boilerplate project.
npx crawlee create my-crawler
Then, navigate to the newly created folder and start the project.
cd my-crawlernpm start
Note: If starting the project throws an error saying Failed to launch the browser, try installing Playright manually using: npx playwright install --with-deps
npx playwright install --with-deps
Manual Installation
Instead of creating a project from scratch, you can install Crawlee on an existing project using the command below. Here, you also need to install Playwright since Crawlee uses PlaywrightCrawler.
npm install crawlee playwright
If you have used Crawlee CLI to create a project, you already have a simple crawler created by default like the one below:
// For more information, see https://crawlee.dev/import { PlaywrightCrawler } from 'crawlee';
// PlaywrightCrawler crawls the web using a headless// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
// Use the requestHandler to process each of the crawled pages.async requestHandler({ request, page, enqueueLinks, log, pushData }) {
const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks();
const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`);
// Save results as JSON to ./storage/datasets/default await pushData({ title, url: request.loadedUrl });
// Extract links from the current page // and add them to the crawling queue. await enqueueLinks();
},
// Comment this option to scrape the full website.maxRequestsPerCrawl: 20,// Uncomment this option to see the browser window.// headless: false,});
// Add first URL to the queue and start the crawl.await crawler.run(['https://crawlee.dev']);
This is a simple script that visits https://crawlee.dev and extracts titles from 20 web pages. The result is logged in the console and saved in storage as a JSON file.
In the above example, the scraping happens in the headless mode. You can enable headful mode by uncommenting the headless: false option. Then, you will see a browser with multiple tabs open like below:
headless: false
You can find more information on basic Crawlee setup in their official documentation.
Now, let’s modify the above script to scrape data from this website: https://quotes.toscrape.com/
It contains popular quotes from various authors and scientists, and I will guide you through how to scrape and log quotes and the name of the author.
Before you start writing code, you need to understand the HTML elements and classes associated with quotes and authors. To do so, right-click on a quote and select Inspect mode. This will show you all the details you need to know about HTML selectors.
As you can see, all the quotes are within dev tags with a class named quote. Quotes have the class named text, and authors have the class name author.
First, we need to wait until all the div tags with the .quote class are loaded.
await page.waitForSelector('.quote');
Then, capture all the elements with the .quote class from the page.
const quotesData = await page.$$eval('.quote', (els) => {…});
Now, you can iterate through the selected div tags and capture the text content from spans with class names .text and .author.
const quotesData = await page.$$eval('.quote', (els) => {return els.map((el) => {const quoteElement = el.querySelector('.text');const authorElement = el.querySelector('.author');const quote = quoteElement ? quoteElement.textContent : 'No quote found';const author = authorElement ? authorElement.textContent : 'No author found';return { quote, author };});});
Finally, you can log the details in the console or write them into JSON files.
quotesData.forEach(({ quote, author }, i) => {console.log(Quote_${i + 1}: ${quote}\nAuthor: ${author}\n);Dataset.pushData({ Quote: quote, Author: author });});
Here is the complete code example of this scraping task, and you can find the complete Crawlee project in this GitHub repository.
import { Dataset, PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({async requestHandler({ page }) {await page.waitForSelector('.quote');
const quotesData = await page.$$eval('.quote', (els) => { return els.map((el) => { const quoteElement = el.querySelector('.text'); const authorElement = el.querySelector('.author'); const quote = quoteElement ? quoteElement.textContent : 'No quote found'; const author = authorElement ? authorElement.textContent : 'No author found'; return { quote, author }; }); }); quotesData.forEach(({ quote, author }, i) => { console.log(`Quote_${i + 1}: ${quote}\nAuthor: ${author}\n`); Dataset.pushData({ Quote: quote, Author: author }); });
const quotesData = await page.$$eval('.quote', (els) => { return els.map((el) => { const quoteElement = el.querySelector('.text'); const authorElement = el.querySelector('.author'); const quote = quoteElement ? quoteElement.textContent : 'No quote found'; const author = authorElement ? authorElement.textContent : 'No author found'; return { quote, author }; }); });
quotesData.forEach(({ quote, author }, i) => { console.log(`Quote_${i + 1}: ${quote}\nAuthor: ${author}\n`); Dataset.pushData({ Quote: quote, Author: author }); });
},});
await crawler.run(['https://quotes.toscrape.com/']);
Since you now have a basic Crawlee web scraping setup, let’s discuss some of its advanced features with examples.
While web scraping provides businesses with valuable data for decision-making, most businesses are not willing to allow others to scrape their own data. They use various techniques to block web scrapers. For example, IP address blocking is one of the most common methods used by many websites.
As a result, most web scraping tools and libraries now come with features to rotate IP addresses using proxies to bypass these restrictions. Here is how Crawlee uses proxies to bypass IP restrictions.
Crawlee provides a class named ProxyConfiguration that allows you to pass a set of proxy addresses as an array.
const proxyConfiguration = new ProxyConfiguration({proxyUrls: ['http://proxy-1.com','http://proxy-2.com',]});
Then, you can use these proxy URLs within the crawler like below:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],});
const crawler = new PlaywrightCrawler({proxyConfiguration,// …});
You can also get more infromation on currently used proxy by using proxyInfo object.
proxyInfo
const crawler = new PlaywrightCrawler({proxyConfiguration,async requestHandler({ proxyInfo }) {console.log(proxyInfo);},// …});
Crawlee provides a class named SessionPool to combine and control the IP rotation with cookies to manage sessions. The SessionPool class can automatically rotate the IP addresses and remove any IP addresses from the pool if they get blocked.
SessionPool
Furthermore, it allows for the tight management of information related to a single IP address, such as cookies, auth tokens and headers. This tight coupling significantly reduces the likelihood of websites blocking IP addresses. The below example shows how to use SessionPool with PlaywrightCrawler.
PlaywrightCrawler
const proxyConfiguration = new ProxyConfiguration({…});
proxyConfiguration, // Activates the Session pool (default is true). useSessionPool: true, // Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100 }, // Set to true if you want the crawler to save cookies per session, persistCookiesPerSession: true, async requestHandler({ page, session }) { const title = await page.title(); if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in PlaywrightCrawler. } },
proxyConfiguration,
// Activates the Session pool (default is true). useSessionPool: true,
// Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100 },
// Set to true if you want the crawler to save cookies per session, persistCookiesPerSession: true,
async requestHandler({ page, session }) { const title = await page.title();
if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in PlaywrightCrawler. } },
});
Crawlee mainly provides 2 request storage types to manage URLs during the web scraping process.
import { RequestQueue } from 'crawlee';
// Open the default request queue associated with the crawler runconst requestQueue = await RequestQueue.open();
// Enqueue the initial batch of requestsawait requestQueue.addRequests([{ url: 'https://example.com/1' },{ url: 'https://example.com/2' },{ url: 'https://example.com/3' },]);
// Open the named request queueconst namedRequestQueue = await RequestQueue.open('named-queue');
// Remove the named request queueawait namedRequestQueue.drop();
import { RequestList, PuppeteerCrawler } from 'crawlee';
// Prepare the sources array with URLs to visitconst sources = [{ url: 'http://www.example.com/page-1' },{ url: 'http://www.example.com/page-2' },{ url: 'http://www.example.com/page-3' },];
// Open the request list.const requestList = await RequestList.open('my-list', sources);
const crawler = new PuppeteerCrawler({requestList,async requestHandler({ page, request }) {// Process the page (extract data, take page screenshot, etc).// No more requests could be added to the request list here},});
Usually, these storages are automatically purged before each scrawl. But you can also manually clear them using the purgeDefaultStorages() method. You can find more details on Crawlee request storages here.
purgeDefaultStorages()
Similar to request storages, Crawlee provides 2 result storages to store the scraped data.
import { KeyValueStore } from 'crawlee';
const input = await KeyValueStore.getInput();await KeyValueStore.setValue('OUTPUT', { myResult: 123 });
const store = await KeyValueStore.open('some-name');
await store.setValue('some-key', { foo: 'bar' });const value = await store.getValue('some-key');
await store.setValue('some-key', null);
import { Dataset } from 'crawlee';
await Dataset.pushData({ col1: 123, col2: 'val2' });
const dataset = await Dataset.open('some-name');
await dataset.pushData({ foo: 'bar' });
await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]);
You can find more details on Crawlee result storages here.
Web scraping is an essential task in collecting data for modern decision-making processes. This article discussed how we can use Crawlee, a popular web scraping tool, to build a custom web scraper. Apart from simple web scraping tasks, Crawlee provides features like session and proxy management to prevent IP blocking and many other supporting functions to improve the scraping experience.
I hope this blog helped you to get started with Crawlee for your next web scraping project. Thank you for reading.
12 min read
Wyatt Mercer
9 min read
Ben Keane