Web Scraping with Crawlee

Web scraping Crawlee image

Today, we live in a data-driven world where even the smallest decisions, like product recommendations, are made based on data driven insights. Companies use various methods to collect data, and web scraping is the most popular method among them.

So, as developers, it is essential to get hold of different tools and techniques that can be used for web scraping so you can contribute more towards the business output.

Therefore, let’s discuss how easily you can get started with web scraping using Crawlee and how to improve the scraping process using proxies.


What is Crawlee?

Crawlee is an open-source web scraping and browser automation library. It has more than 14000 GitHub stars, 22500+ weekly NPM downloads, and suitable scraping modern web applications that heavily rely on JavaScript.

Features of Crawlee

If you’re using Crawlee, there’s several key features that you should know of. Some of the most important features are:

  • HTTP Scraping – Creates HTTP requests that mimic browser headers and TLS fingerprints.
  • Headless Browsers – Supports headless browsers as it is built on top of Puppeteer and Playwright.
  • JavaScript & TypeScript Support – Crawlee runs on Node.js, and it’s built on TypeScript.
  • Automatic Scaling and Proxy Management – capable of automatically scaling based on available resources.
  • Queue and Storage – Supports saving files, screenshots and JSON results.
  • Utils and Configurability – Has built-in tools for extracting social handles, phone numbers, infinite scrolling, blocking unwanted assets, and more.

Installing and Setting Up Crawlee

Getting started with Crawlee is pretty straightforward. Here is what you need to do:

Prerequisites

Before getting started, it’s important to make sure that you have the following installed:

  1. Node.js 16 or higher.

Installation

You can install Crawlee using Crawlee CLI or manually add it to an existing project.

Using Crawlee CLI

To install Crawlee from the CLI, you can use the below command. It will install all the dependencies and create a boilerplate project.

npx crawlee create my-crawler

Then, navigate to the newly created folder and start the project.

cd my-crawler
npm start

Note: If starting the project throws an error saying Failed to launch the browser, try installing Playright manually using: npx playwright install --with-deps

Manual Installation

Instead of creating a project from scratch, you can install Crawlee on an existing project using the command below. Here, you also need to install Playwright since Crawlee uses PlaywrightCrawler.

npm install crawlee playwright

If you have used Crawlee CLI to create a project, you already have a simple crawler created by default like the one below:

// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.

const crawler = new PlaywrightCrawler({

// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {

const title = await page.title();
log.info(`Title of ${request.loadedUrl} is '${title}'`);

// Save results as JSON to ./storage/datasets/default
await pushData({ title, url: request.loadedUrl });

// Extract links from the current page
// and add them to the crawling queue.
await enqueueLinks();

},

// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
// Uncomment this option to see the browser window.
// headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

This is a simple script that visits https://crawlee.dev and extracts titles from 20 web pages. The result is logged in the console and saved in storage as a JSON file.

In the above example, the scraping happens in the headless mode. You can enable headful mode by uncommenting the headless: false option. Then, you will see a browser with multiple tabs open like below:

You can find more information on basic Crawlee setup in their official documentation.


Basic Web Scraping with Crawlee

Now, let’s modify the above script to scrape data from this website: https://quotes.toscrape.com/

It contains popular quotes from various authors and scientists, and I will guide you through how to scrape and log quotes and the name of the author.

Before you start writing code, you need to understand the HTML elements and classes associated with quotes and authors. To do so, right-click on a quote and select Inspect mode. This will show you all the details you need to know about HTML selectors.

As you can see, all the quotes are within dev tags with a class named quote. Quotes have the class named text, and authors have the class name author.

First, we need to wait until all the div tags with the .quote class are loaded.

await page.waitForSelector('.quote');

Then, capture all the elements with the .quote class from the page.

const quotesData = await page.$$eval('.quote', (els) => {

});

Now, you can iterate through the selected div tags and capture the text content from spans with class names .text and .author.

const quotesData = await page.$$eval('.quote', (els) => {
return els.map((el) => {
const quoteElement = el.querySelector('.text');
const authorElement = el.querySelector('.author');
const quote = quoteElement ? quoteElement.textContent : 'No quote found';
const author = authorElement ? authorElement.textContent : 'No author found';
return { quote, author };
});
});

Finally, you can log the details in the console or write them into JSON files.

quotesData.forEach(({ quote, author }, i) => {
console.log(Quote_${i + 1}: ${quote}\nAuthor: ${author}\n);
Dataset.pushData({ Quote: quote, Author: author });
});

Here is the complete code example of this scraping task, and you can find the complete Crawlee project in this GitHub repository.

import { Dataset, PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
async requestHandler({ page }) {
await page.waitForSelector('.quote');

const quotesData = await page.$$eval('.quote', (els) => {
  return els.map((el) => {
    const quoteElement = el.querySelector('.text');
    const authorElement = el.querySelector('.author');
    const quote = quoteElement ? quoteElement.textContent : 'No quote found';
    const author = authorElement ? authorElement.textContent : 'No author found';
    return { quote, author };    
  });
});

quotesData.forEach(({ quote, author }, i) => {
  console.log(`Quote_${i + 1}: ${quote}\nAuthor: ${author}\n`);
  Dataset.pushData({ Quote: quote, Author: author });
});

},
});

await crawler.run(['https://quotes.toscrape.com/']);


Advanced Web Scraping Features

Since you now have a basic Crawlee web scraping setup, let’s discuss some of its advanced features with examples.

1. Proxy Management

While web scraping provides businesses with valuable data for decision-making, most businesses are not willing to allow others to scrape their own data. They use various techniques to block web scrapers. For example, IP address blocking is one of the most common methods used by many websites.

As a result, most web scraping tools and libraries now come with features to rotate IP addresses using proxies to bypass these restrictions. Here is how Crawlee uses proxies to bypass IP restrictions.

Crawlee provides a class named ProxyConfiguration that allows you to pass a set of proxy addresses as an array.

const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy-1.com',
'http://proxy-2.com',
]
});

Then, you can use these proxy URLs within the crawler like below:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],
});

const crawler = new PlaywrightCrawler({
proxyConfiguration,
// …
});

You can also get more infromation on currently used proxy by using proxyInfo object.

const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ proxyInfo }) {
console.log(proxyInfo);
},
// …
});

2. Session Management

Crawlee provides a class named SessionPool to combine and control the IP rotation with cookies to manage sessions. The SessionPool class can automatically rotate the IP addresses and remove any IP addresses from the pool if they get blocked.

Furthermore, it allows for the tight management of information related to a single IP address, such as cookies, auth tokens and headers. This tight coupling significantly reduces the likelihood of websites blocking IP addresses. The below example shows how to use SessionPool with PlaywrightCrawler.

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({

});

const crawler = new PlaywrightCrawler({

proxyConfiguration,

// Activates the Session pool (default is true).
useSessionPool: true,
// Overrides default Session pool configuration
sessionPoolOptions: { maxPoolSize: 100 },
// Set to true if you want the crawler to save cookies per session,
persistCookiesPerSession: true,
async requestHandler({ page, session }) {
    const title = await page.title();

    if (title === 'Blocked') {
        session.retire();
    } else if (title === 'Not sure if blocked, might also be a connection error') {
        session.markBad();
    } else {
        // session.markGood() - this step is done automatically in PlaywrightCrawler.
    }
},

});

3. Request Storages

Crawlee mainly provides 2 request storage types to manage URLs during the web scraping process.

  • Request Queue – Used for deep crawling tasks since it allows to add URLs synamically during the scraping process.

import { RequestQueue } from 'crawlee';

// Open the default request queue associated with the crawler run
const requestQueue = await RequestQueue.open();

// Enqueue the initial batch of requests
await requestQueue.addRequests([
{ url: 'https://example.com/1' },
{ url: 'https://example.com/2' },
{ url: 'https://example.com/3' },
]);

// Open the named request queue
const namedRequestQueue = await RequestQueue.open('named-queue');

// Remove the named request queue
await namedRequestQueue.drop();

  • Request List – Suitable for static scraping tasks since it only allows a predefined set of URLs.

import { RequestList, PuppeteerCrawler } from 'crawlee';

// Prepare the sources array with URLs to visit
const sources = [
{ url: 'http://www.example.com/page-1' },
{ url: 'http://www.example.com/page-2' },
{ url: 'http://www.example.com/page-3' },
];

// Open the request list.
const requestList = await RequestList.open('my-list', sources);

const crawler = new PuppeteerCrawler({
requestList,
async requestHandler({ page, request }) {
// Process the page (extract data, take page screenshot, etc).
// No more requests could be added to the request list here
},
});

Usually, these storages are automatically purged before each scrawl. But you can also manually clear them using the purgeDefaultStorages() method. You can find more details on Crawlee request storages here.


3. Result Storages

Similar to request storages, Crawlee provides 2 result storages to store the scraped data.

  • Key-Value Store: Stores data records/files with unique keys. Ideal for saving crawler states, screenshots, etc.

import { KeyValueStore } from 'crawlee';

const input = await KeyValueStore.getInput();
await KeyValueStore.setValue('OUTPUT', { myResult: 123 });

const store = await KeyValueStore.open('some-name');

await store.setValue('some-key', { foo: 'bar' });
const value = await store.getValue('some-key');

await store.setValue('some-key', null);

  • Dataset: Stores structured data like tables, where each object is a row. Used for storing crawl results.

import { Dataset } from 'crawlee';

await Dataset.pushData({ col1: 123, col2: 'val2' });

const dataset = await Dataset.open('some-name');

await dataset.pushData({ foo: 'bar' });

await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]);

You can find more details on Crawlee result storages here.


Conclusion

Web scraping is an essential task in collecting data for modern decision-making processes. This article discussed how we can use Crawlee, a popular web scraping tool, to build a custom web scraper. Apart from simple web scraping tasks, Crawlee provides features like session and proxy management to prevent IP blocking and many other supporting functions to improve the scraping experience.

I hope this blog helped you to get started with Crawlee for your next web scraping project. Thank you for reading.

arrow_upward