Home / Blog / Web Scraping / Web Scraping with Node.js: A Step-by-Step Guide
Whether aggregating market trends or simplifying data collection, web scraping, which is the process of automatically extracting content from websites, plays a crucial role in turning raw data into actionable insights. In this tutorial, you’ll learn how to scrape dynamic and static web content using Node.js.
Before beginning this tutorial, you need to install Node.js and a few libraries. Node.js is an event-driven, nonblocking runtime that runs JavaScript outside browsers, making it ideal for I/O-bound tasks like web scraping.
If you don’t have Node.js installed, you can refer to their official documentation for installation.
Once Node.js is installed, you need to create a new directory. Open your terminal and run the following:
mkdir web-scraping-nodejs
Navigate into the newly created directory:
cd web-scraping-nodejs
Then, initialize your project using npm:
npm init -y
The -y flag automatically answers “yes” to all prompts, creating a default package.json file.
-y
package.json
To make HTTP requests to target websites with dynamic content and to parse HTML, you need two libraries: Puppeteer and Cheerio.
Cheerio is an ultralight, quick, and flexible framework that works with markup parsing. It can parse files in HTML and XML, and it has an API for navigating and searching the developed data structure. In contrast, Puppeteer controls web browsers through Node.js. It’s asynchronous and slightly slower than Cheerio, but since it evaluates JavaScript, it can scrape dynamic pages.
Use the following command to install the libraries:
npm install puppeteer cheerio
After installation, the libraries are automatically added to your project’s package.jsonfile.
Setting up your project directory is recommended as it helps you write more organized and manageable code. The folder structure for this tutorial looks like this:
/web-scraping-nodejs |-- /data |-- /node_modules |-- /src | |-- makeHttpRequest.js | |-- scrapeDynamicHtml.js | |-- scrapeStaticHtml.js | |-- scrapeWithProxy.js |-- /utils | |-- scrapingUtils.js |-- package-lock.json |-- package.json
To make HTTP requests, inside your project directory, create a new folder named src. Inside this folder, create a makeHttpRequest.js file in your project directory by running the following command:
src
makeHttpRequest.js
mkdir src cd src touch makeHttpRequest.js
Then, open src/makeHttpRequest.js in your code editor and add the following code:
src/makeHttpRequest.js
const https = require("https"); // Import the https module const url = "https://quotes.toscrape.com/"; // URL to fetch https .get(url, (res) => { let data = ""; // Collect data chunks res.on("data", (chunk) => { data += chunk; }); // Handle the end of the data res.on("end", () => { // console.log(data); // Log the data }); }) .on("error", (err) => { console.log("Error: " + err.message); });
This script begins by importing the https module, which makes HTTPS requests. You specify the URL you want to fetch. Then, you use https.get to send a GET request to that URL, gather the data chunks from the response, and log a message indicating the response was received successfully instead of logging the entire response. Finally, the code logs any errors that occur during the request.
https
https.get
GET
Save the file and run the code using Node.js:
node src/makeHttpRequest.js
You should see something like this in your terminal:
$ node src/makeHttpRequest.js <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Quotes to Scrape</title> <link rel="stylesheet" href="/static/bootstrap.min.css"> <link rel="stylesheet" href="/static/main.css"> </head> <body> <div class="container"> <div class="row header-box"> <div class="col-md-8"> <h1> <a href="/" style="text-decoration: none">Quotes to Scrape</a> </h1> </div> <div class="col-md-4"> <p> <a href="/login">Login</a> </p> </div> </div> ...omitted output...
Static HTML content is data that’s readily available in HTML and does not change dynamically. In this section, you’ll learn how to scrape static content from the Quotes to Scrape website using Cheerio.
To help keep your code clean, you need to create utility functions for tasks like making HTTP requests, extracting product titles, and generating file names. Separating these functions avoids repetition, enhances readability, and makes your code easier to debug and maintain.
Create a new folder called utilsinside your project’s root folder. In utils, create a file named scrapingUtils.js and add the following code:
utils
scrapingUtils.js
const fs = require("fs"); const https = require("https"); const fetchHtmlContent = (url) => { return new Promise((resolve, reject) => { https .get( url, { headers: { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3", "Accept-Language": "en-US,en;q=0.9", }, }, (response) => { let htmlData = ""; response.on("data", (chunk) => { htmlData += chunk; }); response.on("end", () => { resolve(htmlData); }); response.on("error", (err) => { reject(err); }); } ) .on("error", (err) => { reject(err); }); }); }; const createFilename = () => { const date = new Date(); const filename = `${date.getFullYear()}-${ date.getMonth() + 1 }-${date.getDate()}-${date.getHours()}-${date.getMinutes()}-${date.getSeconds()}.csv`; return filename; }; const exportDataToCsv = (quotes) => { const filename = createFilename(); let csvContent = "Text,Author,Tags\n"; quotes.forEach(({ text, author, tags }) => { const sanitizedText = text.replace(/,/g, ""); const sanitizedTags = tags.join("|"); // Join tags with a pipe separator csvContent += `"${sanitizedText}","${author}","${sanitizedTags}"\n`; }); const folder = "data"; if (!fs.existsSync(folder)) { fs.mkdirSync(folder); } fs.writeFileSync(`${folder}/${filename}`, csvContent); return `./${folder}/${filename}`; }; module.exports = { createFilename, exportDataToCsv, fetchHtmlContent, };
In the utility file, fetchHtmlContent is a reusable version of makeHttpRequest.js. This utility function helps make HTTP requests to specified URLs using the built-in https module. Since it’s asynchronous, it returns a promise. It also includes the headers User-Agent and Accept-Language, making the request look like it’s coming from a browser. It then stores this received data inside the variable htmlData. The promise resolves with the full HTML content once all data is obtained and the response.on("end") event is triggered. If there is any error in the request, the promise is rejected, and the calling function handles the errors.
fetchHtmlContent
User-Agent
Accept-Language
htmlData
response.on("end")
createFilename creates unique file names using timestamps. The JavaScript Date object captures the current date and time and then formats it into a string (ie 2024-8-23-14-30-15.csv), which is then returned as the file name.
createFilename
Date
2024-8-23-14-30-15.csv
exportDataToCsv generates a file name using createFilename(). Then, it initializes the CSV content with a header row containing the text, author, and tags. It sanitizes the text for each quote in the quotes array, and it builds the content of the CSV row by row. Finally, it saves the data to a file within the data folder, creating one if it does not exist. The content is written to the file using fs.writeFileSync.
exportDataToCsv
createFilename()
fs.writeFileSync
Next, in your scrapeStaticHtml.js file, add the following code:
scrapeStaticHtml.js
const cheerio = require("cheerio"); const { fetchHtmlContent, createFilename, exportDataToCsv, } = require("../utils/scrapingUtils"); const targetURL = "https://quotes.toscrape.com/"; const scrapeData = async (targetURL) => { try { const htmlContent = await fetchHtmlContent(targetURL); const $ = cheerio.load(htmlContent); const scrapedQuotes = []; $(".quote").each((i, element) => { const quoteElement = $(element); const text = quoteElement.find(".text").text(); const author = quoteElement.find(".author").text(); const tags = []; quoteElement.find(".tags .tag").each((j, tagElement) => { tags.push($(tagElement).text()); }); if (text && author) { scrapedQuotes.push({ text, author, tags }); } }); exportDataToCsv(scrapedQuotes); console.log({ total_quotes: scrapedQuotes.length, status: "Scraping completed successfully!", saved_file: createFilename(), }); } catch (error) { console.error({ status: "đ± Oops! Something went wrong during scraping.", details: error.message, }); process.exit(1); } }; // Call the scrapeData function with the URL scrapeData(targetURL);
This script starts by importing the Cheerio library and utility functions (fetchHtmlContent, createFilename, and exportDataToCsv) from the utils module. Next, using the fetchHtmlContent function, the script retrieves the HTML content of the page and loads it for parsing using Cheerio. It then iterates through each quote item, extracting the text, author, and tags. If both text and author details are present, the quote is added to the scrapedQuotes array. Finally, the exportDataToCsv function saves the scraped data to a CSV file while logging the total number of quotes scraped, the file name where the data is saved, and any errors encountered during the process. The last line triggers the execution of the script.
scrapedQuotes
node src/scrapeStaticHtml.js
You should see a file named "....csv" in your directory that looks like this:
"....csv"
Text,Author,Tags ""The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."","Albert Einstein","change|deep-thoughts|thinking|world" ""It is our choices Harry that show what we truly are far more than our abilities."","J.K. Rowling","abilities|choices"
Dynamic web content refers to data generated or updated by JavaScript, meaning you have to run JavaScript to obtain the information. This section teaches you how to scrape dynamic content from the Quotes to Scrape website using Puppeteer, a headless browser automation library.
To scrape dynamic data, navigate into the scrapeDynamicContent.js file in your project’s src folder and add the following code:
scrapeDynamicContent.js
const puppeteer = require("puppeteer"); const { exportDataToCsv } = require("../utils/scrapingUtils"); (async () => { const browser = await puppeteer.launch({ headless: false, // Set to true to run in headless mode defaultViewport: null, userDataDir: "./tmp", // Save browser session data to the tmp folder }); const page = await browser.newPage(); await page.goto("https://quotes.toscrape.com/js/page/1/"); // URL to scrape let isLastPage = false; const quotes = []; // Function to extract quote details const extractQuoteDetails = async (quoteElement) => { const text = await page.evaluate((el) => { const textElement = el.querySelector(".text"); // Get the quote text element return textElement ? textElement.textContent : null; }, quoteElement); const author = await page.evaluate((el) => { const authorElement = el.querySelector(".author"); // Get the author element return authorElement ? authorElement.textContent : null; }, quoteElement); const tags = await page.evaluate((el) => { const tagElements = el.querySelectorAll(".tag"); // Get all tag elements return Array.from(tagElements).map(tag => tag.textContent); }, quoteElement); return { text: text, author, tags }; }; while (!isLastPage) { // Loop through all pages until the last page await page.waitForSelector(".quote"); // Ensure quote elements are loaded const quoteHandles = await page.$$(".quote"); for (const quoteElement of quoteHandles) { try { const quoteDetails = await extractQuoteDetails(quoteElement); const { text, author } = quoteDetails; // Extract text and author from quote details if (text && author) { quotes.push(quoteDetails); } } catch (error) { console.error("Error extracting quote details:", error); } } const nextButton = await page.$("li.next > a"); isLastPage = !nextButton; console.log("Is Last Page:", isLastPage); if (!isLastPage) { // If not the last page, click the next button await Promise.all([ nextButton.click(), // Click the next page button page.waitForNavigation({ waitUntil: "networkidle2", // Wait for the next page to be fully loaded }), ]); } } // Export all collected quotes to CSV after the scraping loop completes console.log("Total quotes scraped:", quotes.length); exportDataToCsv(quotes); // Export quotes to CSV file await browser.close(); })();
This script starts by launching a new instance of Puppeteer with specific options. With headless: false, the browser is visible on the screen while scraping. The option defaultViewport: null instructs Puppeteer to use the default viewport size for the browser instance it opens, and userDataDir: "./tmp" specifies the tmp directory as the location to save browser session data. Next, a new browser instance opens, and the script navigates to the target URL. Following that, the script enters a loop to scrape multiple pages until there are no more pages to scrape. The script adds a quote (with the text, author, and tags) to the quotes array for each page loop and does this for all the quotes on the page. This cycle continues until the last page is reached by checking if there are any more next buttons on the page.
headless: false
defaultViewport: null
userDataDir: "./tmp"
tmp
When the scraping is complete, the exportDataToCsv function is called, and the array of quotes is stored in a CSV file.
node src/scrapeDynamicContent.js
The extracted product details are saved to a CSV file, and the progress logs look like this in the terminal:
$ node src/scrapeDynamicContent.js Is Last Page: false Is Last Page: false ... Is Last Page: false Is Last Page: true Total quotes scraped: 100
When scraping websites, you may find that your IP address is temporarily blocked or restricted. This happens when the website detects too many requests originating from the same IP address within a short period. Thankfully, this issue can be avoided using a proxy server. A proxy server acts as a middleman between your device and the websites that you visit. It masks your actual IP address with the IP address of the proxy server, concealing your identity.
If a website blocks one of your proxy’s IP addresses, you can switch to another IP and continue scraping without interruptions.
To set up a proxy server, you first need to select one. Both free and paid proxies are available from various sources. For this tutorial, you can use one of the following websites:
Once youâve selected a proxy, add it to your script. Here, youâll use an elite proxy from China on ProxyNova.
When selecting a proxy, select elite or high anonymity proxies; the target server will not know your IP address or that the request is being routed through a proxy server.
Following is an example of how you can modify your script to use a proxy server:
const puppeteer = require("puppeteer"); const { exportDataToCsv } = require("../utils/scrapingUtils"); // List of proxy servers - you can add more working proxies here const proxyList = ["101.37.12.43:8000"]; // Select a random proxy from the list const randomProxy = proxyList[Math.floor(Math.random() * proxyList.length)]; (async () => { // Launch browser with proxy const browser = await puppeteer.launch({ headless: true, defaultViewport: null, args: [`--proxy-server=${randomProxy}`, "--ignore-certificate-errors"], }); const page = await browser.newPage(); // Set a longer timeout for navigation page.setDefaultNavigationTimeout(120000); // 2 minutes // Verify proxy IP try { await page.goto("http://httpbin.org/ip", { waitUntil: "domcontentloaded" }); const proxyIp = await page.evaluate(() => document.body.innerText); console.log("Proxy IP:", proxyIp); } catch (error) { console.error("Error verifying proxy IP:", error); } // Navigate to the quotes page try { await page.goto("https://quotes.toscrape.com/js/page/1/", { waitUntil: "networkidle2", timeout: 120000 // 2 minutes }); } catch (error) { console.error("Error navigating to the quotes page:", error); await browser.close(); return; } let isLastPage = false; const quotes = []; // Function to extract quote details const extractQuoteDetails = async (quoteElement) => { const text = await page.evaluate((el) => { const textElement = el.querySelector(".text"); return textElement ? textElement.textContent : null; }, quoteElement); const author = await page.evaluate((el) => { const authorElement = el.querySelector(".author"); return authorElement ? authorElement.textContent : null; }, quoteElement); const tags = await page.evaluate((el) => { const tagElements = el.querySelectorAll(".tag"); return Array.from(tagElements).map((tag) => tag.textContent); }, quoteElement); return { text, author, tags }; }; while (!isLastPage) { try { await page.waitForSelector(".quote", { timeout: 60000 }); // Increase timeout to 1 minute const quoteHandles = await page.$$(".quote"); for (const quoteElement of quoteHandles) { try { const quoteDetails = await extractQuoteDetails(quoteElement); const { text, author } = quoteDetails; if (text && author) { quotes.push(quoteDetails); } } catch (error) { console.error("Error extracting quote details:", error); } } const nextButton = await page.$("li.next > a"); isLastPage = !nextButton; console.log("Is Last Page:", isLastPage); if (!isLastPage) { await Promise.all([ page.click("li.next > a"), page.waitForNavigation({ waitUntil: "networkidle2", timeout: 60000 }), ]); } } catch (error) { console.error("Error waiting for selector or navigating:", error); if (error.name === 'TimeoutError') { console.log("Attempting to reload the page..."); await page.reload({ waitUntil: "networkidle2", timeout: 120000 }); continue; // Try again with the reloaded page } break; // Exit the loop for other types of errors } } console.log("Total quotes scraped:", quotes.length); exportDataToCsv(quotes); await browser.close(); })();
This script launches a headless browser with Puppeteer and uses one random proxy server from the list for its connections. The script then visits http://httpbin.org/ip, which returns the proxyâs IP address and logs it to ensure the proxy is used in this session.
http://httpbin.org/ip
After confirming the proxy connection, the script navigates to the quotes website (https://quotes.toscrape.com/js/page/1/) and scrapes quotes from each page, including the text, author, and associated tags, and then stores them in an array. The script loops through multiple pages by clicking the ânextâ button, continuing until the last page is reached.
https://quotes.toscrape.com/js/page/1/
If there is an error in navigating or scraping the page, such as a timeout, the page is refreshed, and the script attempts to run again. If all pages are scraped, the collected quotes will be exported to a CSV file with a custom function called exportDataToCsv.
Save the file and run the code:
node src/scrapeWithProxy.js
You should see logs similar to this in your terminal:
Using Proxy IP: { "origin": "101.37.12.43" } Is Last Page: false Is Last Page: false ... Is Last Page: false Is Last Page: true Total quotes scraped: 100
When a proxy is not working correctly, youâll get an error like this:
Error: net::ERR_TIMED_OUT at http://httpbin.org/ip ERR_TUNNEL_CONNECTION_FAILED at http://httpbin.org/ip ERR_PROXY_CONNECTION_FAILED at http://httpbin.org/ip
This error indicates that the proxy server is unavailable, overloaded, or unresponsive. To resolve this issue, you can select another proxy from your list.
All the source code for this tutorial can be found in this GitHub repository.
Websites often implement challenges like reCAPTCHAs to identify human activities and prevent those that could violate their terms of service. One of the most common reCAPTCHA issues is scraping intermittence. These security features disrupt automated scripts by increasing complexity and requiring extra steps to bypass challenges before scraping can continue. If your scraping activities trigger several reCAPTCHAs, the website may flag your IP address, which could result in IP blocking.
There are several ways you can circumvent these challenges. For instance, you could use proxies to mask the IP address from which requests come, lowering the possibility of reCAPTCHAs being triggered. You can also introduce random delays between requests to make your scraper more human-like in its behavior.
Another helpful approach is to use browser automation tools, such as Puppeteer, which can emulate user events and sometimes even bypass simpler reCAPTCHAs. You could also consider using third-party CAPTCHA-solving services like 2Captcha.
With any web scraping project, always respect a website’s terms of service to ensure your actions are ethical and legal. Avoid scraping personal data and limit request frequency to prevent server overload. Finally, make sure you use APIs whenever possible for ethical scraping.
In this article, you learned how to scrape dynamic and static web content using Node.js, Cheerio, and Puppeteer. While applying these skills, remember that IP bans, blocks, and CAPTCHAs are bound to occur. Thankfully, modern proxy solutions provide a reliable way to overcome such challenges. Residential proxies and high-speed data center proxies can help guarantee stable and uninterrupted access to your target data.
If you want to take your web scraping to the next level, finding the right proxy solution is essential. Proxies provide the reliability and performance you need and ensure that your scraping activities remain responsible and compliant.
Looking for a reliable proxy provider? Read our review of top proxy providers.
10 min read
Wyatt Mercer
7 min read