Web Scraping with Puppeteer

Learn the ins and outs of web scraping with Puppeteer in this guide.

Web scraping Puppeteer

Web scraping is an essential part in the digital age, and helps businesses scrap large amounts of data from websites to use it for better decision making. And, right now, there are over hundreds of scraping tools to choose from.

But, selecting the right tool is not an easy task. If you choose wrong, your tool might not extend to the fullest of your use cases. So, that’s where this guide comes in. Let’s take a look at one of the best tools for web scraping – Puppeteer.

Simply put, if you’ve got a complex scraping use cases where you would need to scrape dynamic content optimally, this is your tool.


What is Puppeteer?

Puppeteer is a Node.js library developed by Google to manage and automate Chrome or Chromium through the DevTools Protocol.

It’s different than your scraping solutions because it acts as a headless browser that allow us to render complete web pages and execute JavaScript, mimicking the behavior of an actual user interacting with dynamic content.

Pst, for an in-depth walkthrough on Puppeteer, check this out.

Features of Puppeteer

One of the main reason to use Puppeteer is when your scraping use case is requires you to go through JavaScript-heavy websites. Puppeteer will render the entire page so that all content, including dynamically loaded elements are recorded.

Apart from that, there are several features that make Puppeteer a powerful scraping solution. Some of these features include:

  • Automated Testing: Native automation interface to simulate user interaction with a web page, Suitable for testing of end-to-end.
  • Generate PDF and Screenshot: Generate high-quality reports/screenshots of web pages for report or content-sharing.
  • Chrome/Chromium Control: Enables programmatic control of the browser, including network requests interception.
  • Headless and headful modes: Operates in headless mode without a UI or headful for tasks demand visual feedback.

Compared to robust frameworks like Selenium, Puppeteer comes with an easier-to-use, more up-to-date API and good support for running against Chrome/Chromium making it find unique position among web scraping solutions.


Basic Web Scraping with Puppeteer

So, now that we’re familiar with Puppeteer, let’s take a look at how we can scrape a website using it.

Pre-requisites

First things first, you’ll need to make sure you’ve installed Puppeteer. To do so, run the following command in your terminal:

npm i puppeteer # Downloads compatible Chrome during installation.

This will install Puppeteer along with a Chromium runtie.

Scraping a website using Puppeteer

Here’s a simple example of web scraping with Puppeteer. Below is a simple webpage and it’s URL is https://example.com.

Let’s see how we can write a code to scrape the titles (h1) in the page.

import puppeteer from "puppeteer";

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto("https://example.com");

const articleTitles = await page.evaluate(() => {
const titles = Array.from(document.querySelectorAll("h1"));
return titles.map((title) => title.innerText);
});

console.log(articleTitles);

await browser.close();

If you go through the above code, it’s easer to under stand what is happening. Line 3 and 4 launches a new browser instance and open a new page. In line 6, it navigates to the URL https://example.com. The function in line 8 filter out all the title texts. In line 13 it outputs the titles scrape and line 15 closes the browser.

Above the output we are getting by running the example code.


Advanced Web Scraping Techniques

But, Pupeteer doesn’t stop there. There are many complex features that you can leverage.

Handling Pagination with Puppeteer

For instance, you might be scraping a site that leverages cursor pagination.

Simply put, it’s the sites where you see the “Load More” or “Next” action that lets you load more data onto the site.

Well, in such cases you need to make sure that your scraping solution traverses through all of the pages and scrapes for the data.

Here are the steps to handle pagination with Puppeteer:

  1. First, open the source you are willing to scrape starting from the first page.
  2. Extract the data from the current page page.
  3. Find the “Next” button within the current page and simulate a click to go to the next page.
  4. Continue the above steps until the “Next” button cannot be seen anymore, this means that the last page has been reached.
  5. Scrape the data from each page, then save it in an array or a database.

Below code snippet will scrape data from a paginated web page. Here we are scraping data from a GitHub search result for “puppeteer pagination” which has a pagination.

import { launch } from "puppeteer";

const browser = await launch();
const page = await browser.newPage();
const url = "https://github.com/search?q=puppeteer+pagination&type=repositories";

await page.goto(url);

let data = [];
let hasNextPage = true;

while (hasNextPage) {
// Scrape data from the current page
const pageData = await page.evaluate(() => {
// Scraping h3 titles
const titles = Array.from(document.querySelectorAll("h3")).map(
(title) => title.innerText
);
return titles;
});

data = data.concat(pageData);

// Check if there's a "Next" button and click it
hasNextPage = await page.evaluate(() => {
const nextButton = document.querySelector("a.next");
if (nextButton) {
nextButton.click();
return true;
}
return false;
});

// Wait for navigation to complete before scraping the next page
if (hasNextPage) {
await page.waitForNavigation({ waitUntil: "networkidle2" });
}
}

Dealing with JavaScript-heavy Sites with Puppeteer

If you’re scraping a site that’s JavaScript heavy, it’s important to wait for the required content to load before it begins to scrape. This can be done with Puppeteer’s in-built methods to initiate and observe the actions of the page.

The following steps outline the general approach:

  1. Navigate to the target URL.
  2. Utilize page.waitForSelector() or page.waitForFunction() to ensure that the dynamic content has fully loaded before proceeding.
  3. After the page has been fully loaded, extract the required data as would be done before.

Here is an example code to scrape data from a JavaScript-heavy website.

import { launch } from "puppeteer";

const browser = await launch();
const page = await browser.newPage();
const url = "https://quotes.toscrape.com/js/";

// Navigate to the URL
await page.goto(url, { waitUntil: "networkidle2" });

// Wait for the dynamic content to load by waiting for a specific selector
await page.waitForSelector(".quote");

// Scrape data after ensuring the content has loaded
const quotes = await page.evaluate(() => {
const quoteElements = document.querySelectorAll(".quote");
const quotesArray = [];
quoteElements.forEach((quoteElement) => {
const text = quoteElement.querySelector(".text").innerText;
const author = quoteElement.querySelector(".author").innerText;
quotesArray.push({ text, author });
});
return quotesArray;
});

// Log the scraped quotes
console.log(quotes);

await browser.close();

Capturing Network Requests with Puppeteer

Whenever you open a webpage on the browser, the browser sends or make many network calls to download HTML, CSS, JS files, images, data from APIs etc. Puppeteer enables you to capture such requests so that you get to see the data that is being transmitted. You can use this to:

  1. Monitoring particular API calls made by the website in order to obtain structured data that might be simpler to deal with than the scraped HTML.
  2. Pre-process outgoing requests, that is make modifications to the request parameters or headers that are being sent or even decline certain types of requests.
  3. Analyze Responses by capturing responses from the server, and parsing that response for useful information.

For the example we have used a page in openweathermap.org which sends an API request to https://api.openweathermap.org/data/2.5/weather?id=2172797&appid=5796abbde9106b7da4febfae8c44c232 endpoint. Following code will extract data from this API request and from it’s response.

import { launch } from "puppeteer";

const browser = await launch();
const page = await browser.newPage();
const url = "https://openweathermap.org/city/2172797";

// Intercept network requests
await page.setRequestInterception(true);

page.on("request", (request) => {
if (request.url().includes("/data/2.5/weather")) {
console.log(Intercepted request: ${request.url()});
console.log("Request method:", request.method());
console.log("Request headers:", request.headers());
console.log("Request post data:", request.postData());
}
request.continue(); // Continue the request without modification
});

page.on("response", async (response) => {
if (response.url().includes("/data/2.5/weather")) {
console.log(Captured response from: ${response.url()});
console.log("Response status:", response.status());
const responseData = await response.json();
console.log("Response data:", JSON.stringify(responseData, null, 2));
}
});

await page.goto(url, { waitUntil: "networkidle2" });

await browser.close();


Issues of Web Scraping

Most times, you’ll run into issues where the server you’re trying to scrape data off from blocks your requests as it flags you as a bot. Well, in such cases, you’ll need to leverage tools likeProxies that help your scraper function as normal by:

  • IP Rotation: Websites prevent data scraping with the use of scripts that allow only a limited number of accesses from a particular IP address at a given time. By rotating IPs through a pool of proxies, you can distribute requests across multiple IP addresses, and the web site is not able to identify and black list your scrape.
  • Geo Swapping: Proxies enable you to interact as if you are in different regions. This is especially useful for scraping localized context information and to dealing with restrictions that allow some content depending on the physical location of the user.
  • Adaptive Rate Control: Websites set pace limiting measure to enable the website administrators cap the intensity of the requests coming from a particular user or IP address. It enables the work-around of these limits by distributing the request over multiple IP addresses as a way of scraping data without necessarily hitting on the rate limit again.

Integrating proxies with Puppeteer is straightforward. You can set up Puppeteer to use a proxy server by specifying the proxy server’s address during the browser launch. Here’s how you can do it:

import { launch } from "puppeteer";

// Replace with your proxy server address and port
const proxyServer = "http://123.456.789.0:8080";

const browser = await launch({
args: [--proxy-server=${proxyServer}],
});
const page = await browser.newPage();

// Optionally set your geolocation if the proxy server supports it
await page.setGeolocation({ latitude: 37.7749, longitude: -122.4194 }); // Example: San Francisco

// Navigate to a website to test the proxy
await page.goto("https://www.whatismyip.com", { waitUntil: "networkidle2" });

// Capture a screenshot to verify the IP
await page.screenshot({ path: "proxy-test.png" });

await browser.close();


Concluding Thoughts

Puppeteer is the powerful tool which allows to deal with all kinds of problems starting from simple webpage scraping and going up to such complex cases, as dealing with difficult web elements, including dynamic content, real-time content and API usage.

Some of the techniques include handling websites with heavy-JavaScript, getting network requests and use of proxies can make the scraping more efficient and effective.

However, as you apply these methods, it’s essential to scrape responsibly, adhering to the legal and ethical guidelines of the websites you interact with. Puppeteer’s versatility and power make it an invaluable tool for any data extraction needs, helping you gather insights efficiently and accurately.

arrow_upward