Today, we live in a data-driven world where even the smallest decisions, like product recommendations, are made based on data driven insights. Companies use various methods to collect data, and web scraping is the most popular method among them.
So, as developers, it is essential to get hold of different tools and techniques that can be used for web scraping, so you can contribute more towards the business output.
Therefore, let’s discuss how easily you can get started with web scraping using Crawlee and how to improve the scraping process using proxies.
What is Crawlee?

Crawlee is an open-source web scraping and browser automation library. It has more than 14000 GitHub stars, 22500+ weekly NPM downloads, and suitable scraping modern web applications that heavily rely on JavaScript.
Features of Crawlee
If you’re using Crawlee, there are several key features that you should know of. Some of the most important features are:
- HTTP Scraping – Creates HTTP requests that mimic browser headers and TLS fingerprints.
- Headless Browsers – Supports headless browsers as it is built on top of Puppeteer and Playwright.
- JavaScript & TypeScript Support – Crawlee runs on Node.js, and it’s built on TypeScript.
- Automatic Scaling and Proxy Management – capable of automatically scaling based on available resources.
- Queue and Storage – Supports saving files, screenshots and JSON results.
- Utils and Configurability – Has built-in tools for extracting social handles, phone numbers, infinite scrolling, blocking unwanted assets, and more.
Installing and Setting Up Crawlee
Getting started with Crawlee is pretty straightforward. Here is what you need to do:
Prerequisites
Before getting started, it’s important to make sure that you have the following installed:
- Node.js 16 or higher.
Installation
You can install Crawlee using Crawlee CLI or manually add it to an existing project.
Using Crawlee CLI
To install Crawlee from the CLI, you can use the below command. It will install all the dependencies and create a boilerplate project.
npx crawlee create my-crawler
Then, navigate to the newly created folder and start the project.
cd my-crawler<br>npm start
Note: If starting the project throws an error saying Failed to launch the browser, try installing Playright manually using: npx playwright install --with-deps
Manual Installation
Instead of creating a project from scratch, you can install Crawlee on an existing project using the command below. Here, you also need to install Playwright since Crawlee uses PlaywrightCrawler.
npm install crawlee playwright
If you have used Crawlee CLI to create a project, you already have a simple crawler created by default like the one below:
// For more information, see https://crawlee.dev/<br>import { PlaywrightCrawler } from 'crawlee';
// PlaywrightCrawler crawls the web using a headless<br>// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
// Use the requestHandler to process each of the crawled pages.<br>async requestHandler({ request, page, enqueueLinks, log, pushData }) {
<code>const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`);// Save results as JSON to ./storage/datasets/default await pushData({ title, url: request.loadedUrl });// Extract links from the current page // and add them to the crawling queue. await enqueueLinks();
},
// Comment this option to scrape the full website.<br>maxRequestsPerCrawl: 20,<br>// Uncomment this option to see the browser window.<br>// headless: false,<br>});
// Add first URL to the queue and start the crawl.<br>await crawler.run(['https://crawlee.dev']);
This is a simple script that visits https://crawlee.dev and extracts titles from 20 web pages. The result is logged in the console and saved in storage as a JSON file.

In the above example, the scraping happens in the headless mode. You can enable headful mode by uncommenting the headless: false option. Then, you will see a browser with multiple tabs open like below:

You can find more information on basic Crawlee setup in their official documentation.
Basic Web Scraping with Crawlee
Now, let’s modify the above script to scrape data from this website: https://quotes.toscrape.com/
It contains popular quotes from various authors and scientists, and I will guide you through how to scrape and log quotes and the name of the author.

Before you start writing code, you need to understand the HTML elements and classes associated with quotes and authors. To do so, right-click on a quote and select Inspect mode. This will show you all the details you need to know about HTML selectors.

As you can see, all the quotes are within dev tags with a class named quote. Quotes have the class named text, and authors have the class name author.
First, we need to wait until all the div tags with the .quote class are loaded.
await page.waitForSelector('.quote');
Then, capture all the elements with the .quote class from the page.
const quotesData = await page.$$eval('.quote', (els) => {<br>…<br>});
Now, you can iterate through the selected div tags and capture the text content from spans with class names .text and .author.
const quotesData = await page.$$eval('.quote', (els) => {<br>return els.map((el) => {<br>const quoteElement = el.querySelector('.text');<br>const authorElement = el.querySelector('.author');<br>const quote = quoteElement ? quoteElement.textContent : 'No quote found';<br>const author = authorElement ? authorElement.textContent : 'No author found';<br>return { quote, author };<br>});<br>});
Finally, you can log the details in the console or write them into JSON files.
quotesData.forEach(({ quote, author }, i) => {<br>console.log(Quote_${i + 1}: ${quote}\nAuthor: ${author}\n);<br>Dataset.pushData({ Quote: quote, Author: author });<br>});
Here is the complete code example of this scraping task, and you can find the complete Crawlee project in this GitHub repository.
import { Dataset, PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({<br>async requestHandler({ page }) {<br>await page.waitForSelector('.quote');
<code>const quotesData = await page.$$eval('.quote', (els) => { return els.map((el) => { const quoteElement = el.querySelector('.text'); const authorElement = el.querySelector('.author'); const quote = quoteElement ? quoteElement.textContent : 'No quote found'; const author = authorElement ? authorElement.textContent : 'No author found'; return { quote, author }; }); });quotesData.forEach(({ quote, author }, i) => { console.log(`Quote_${i + 1}: ${quote}\nAuthor: ${author}\n`); Dataset.pushData({ Quote: quote, Author: author }); });
},<br>});
await crawler.run(['https://quotes.toscrape.com/']);

Advanced Web Scraping Features
Since you now have a basic Crawlee web scraping setup, let’s discuss some of its advanced features with examples.
1. Proxy Management
While web scraping provides businesses with valuable data for decision-making, most businesses are not willing to allow others to scrape their own data. They use various techniques to block web scrapers. For example, IP address blocking is one of the most common methods used by many websites.
As a result, most web scraping tools and libraries now come with features to rotate IP addresses using proxies to bypass these restrictions. Here is how Crawlee uses proxies to bypass IP restrictions.
Crawlee provides a class named ProxyConfiguration that allows you to pass a set of proxy addresses as an array.
const proxyConfiguration = new ProxyConfiguration({<br>proxyUrls: [<br>'http://proxy-1.com',<br>'http://proxy-2.com',<br>]<br>});
Then, you can use these proxy URLs within the crawler like below:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({<br>proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'],<br>});
const crawler = new PlaywrightCrawler({<br>proxyConfiguration,<br>// …<br>});
You can also get more infromation on currently used proxy by using <a href="https://crawlee.dev/api/core/interface/ProxyInfo">proxyInfo</a> object.
const crawler = new PlaywrightCrawler({<br>proxyConfiguration,<br>async requestHandler({ proxyInfo }) {<br>console.log(proxyInfo);<br>},<br>// …<br>});
2. Session Management
Crawlee provides a class named SessionPool to combine and control the IP rotation with cookies to manage sessions. The SessionPool class can automatically rotate the IP addresses and remove any IP addresses from the pool if they get blocked.
Furthermore, it allows for the tight management of information related to a single IP address, such as cookies, auth tokens and headers. This tight coupling significantly reduces the likelihood of websites blocking IP addresses. The below example shows how to use SessionPool with PlaywrightCrawler.
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({<br>…<br>});
const crawler = new PlaywrightCrawler({
<code>proxyConfiguration,// Activates the Session pool (default is true). useSessionPool: true,// Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100 },// Set to true if you want the crawler to save cookies per session, persistCookiesPerSession: true,async requestHandler({ page, session }) { const title = await page.title();if (title === 'Blocked') { session.retire(); } else if (title === 'Not sure if blocked, might also be a connection error') { session.markBad(); } else { // session.markGood() - this step is done automatically in PlaywrightCrawler. } },
});
3. Request Storages
Crawlee mainly provides 2 request storage types to manage URLs during the web scraping process.
- Request Queue – Used for deep crawling tasks since it allows to add URLs synamically during the scraping process.
import { RequestQueue } from 'crawlee';
// Open the default request queue associated with the crawler run<br>const requestQueue = await RequestQueue.open();
// Enqueue the initial batch of requests<br>await requestQueue.addRequests([<br>{ url: 'https://example.com/1' },<br>{ url: 'https://example.com/2' },<br>{ url: 'https://example.com/3' },<br>]);
// Open the named request queue<br>const namedRequestQueue = await RequestQueue.open('named-queue');
// Remove the named request queue<br>await namedRequestQueue.drop();
- Request List – Suitable for static scraping tasks since it only allows a predefined set of URLs.
import { RequestList, PuppeteerCrawler } from 'crawlee';
// Prepare the sources array with URLs to visit<br>const sources = [<br>{ url: 'http://www.example.com/page-1' },<br>{ url: 'http://www.example.com/page-2' },<br>{ url: 'http://www.example.com/page-3' },<br>];
// Open the request list.<br>const requestList = await RequestList.open('my-list', sources);
const crawler = new PuppeteerCrawler({<br>requestList,<br>async requestHandler({ page, request }) {<br>// Process the page (extract data, take page screenshot, etc).<br>// No more requests could be added to the request list here<br>},<br>});
Usually, these storages are automatically purged before each scrawl. But you can also manually clear them using the purgeDefaultStorages() method. You can find more details on Crawlee request storages here.
3. Result Storages
Similar to request storages, Crawlee provides 2 result storages to store the scraped data.
- Key-Value Store: Stores data records/files with unique keys. Ideal for saving crawler states, screenshots, etc.
import { KeyValueStore } from 'crawlee';
const input = await KeyValueStore.getInput();<br>await KeyValueStore.setValue('OUTPUT', { myResult: 123 });
const store = await KeyValueStore.open('some-name');
await store.setValue('some-key', { foo: 'bar' });<br>const value = await store.getValue('some-key');
await store.setValue('some-key', null);
- Dataset: Stores structured data like tables, where each object is a row. Used for storing crawl results.
import { Dataset } from 'crawlee';
await Dataset.pushData({ col1: 123, col2: 'val2' });
const dataset = await Dataset.open('some-name');
await dataset.pushData({ foo: 'bar' });
await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]);
You can find more details on Crawlee result storages here.
Conclusion
Web scraping is an essential task in collecting data for modern decision-making processes. This article discussed how we can use Crawlee, a popular web scraping tool, to build a custom web scraper. Apart from simple web scraping tasks, Crawlee provides features like session and proxy management to prevent IP blocking and many other supporting functions to improve the scraping experience.
I hope this blog helped you to get started with Crawlee for your next web scraping project. Thank you for reading.
FAQs
Crawlee is an open-source web scraping and automation library designed for JavaScript and TypeScript. It simplifies data extraction by handling browser automation, request management, and anti-bot evasion, making web scraping more efficient and scalable.
Crawlee stands out due to its built-in support for headless browsers, proxy rotation, and request retries. Unlike basic HTTP request libraries, Crawlee is optimized for large-scale scraping and includes features to bypass anti-scraping mechanisms.
Crawlee includes features like proxy rotation and session management to help bypass basic anti-scraping measures. However, for advanced CAPTCHAs, additional tools like CAPTCHA-solving services or AI-based solutions may be required.
Yes, Crawlee is beginner-friendly, offering simple APIs and extensive documentation. While some JavaScript or TypeScript knowledge is helpful, Crawlee’s built-in features make it easier to start web scraping without deep technical expertise.
Leave a Comment
Required fields are marked *