Home / Blog / Web Scraping / Web Scraping with Puppeteer
Learn the ins and outs of web scraping with Puppeteer in this guide.
Web scraping is an essential part in the digital age, and helps businesses scrap large amounts of data from websites to use it for better decision making. And, right now, there are over hundreds of scraping tools to choose from.
But, selecting the right tool is not an easy task. If you choose wrong, your tool might not extend to the fullest of your use cases. So, that’s where this guide comes in. Let’s take a look at one of the best tools for web scraping – Puppeteer.
Simply put, if you’ve got a complex scraping use cases where you would need to scrape dynamic content optimally, this is your tool.
Puppeteer is a Node.js library developed by Google to manage and automate Chrome or Chromium through the DevTools Protocol.
It’s different than your scraping solutions because it acts as a headless browser that allow us to render complete web pages and execute JavaScript, mimicking the behavior of an actual user interacting with dynamic content.
Pst, for an in-depth walkthrough on Puppeteer, check this out.
One of the main reason to use Puppeteer is when your scraping use case is requires you to go through JavaScript-heavy websites. Puppeteer will render the entire page so that all content, including dynamically loaded elements are recorded.
Apart from that, there are several features that make Puppeteer a powerful scraping solution. Some of these features include:
Compared to robust frameworks like Selenium, Puppeteer comes with an easier-to-use, more up-to-date API and good support for running against Chrome/Chromium making it find unique position among web scraping solutions.
So, now that we’re familiar with Puppeteer, let’s take a look at how we can scrape a website using it.
First things first, you’ll need to make sure you’ve installed Puppeteer. To do so, run the following command in your terminal:
npm i puppeteer # Downloads compatible Chrome during installation.
This will install Puppeteer along with a Chromium runtie.
Here’s a simple example of web scraping with Puppeteer. Below is a simple webpage and it’s URL is https://example.com.
Let’s see how we can write a code to scrape the titles (h1) in the page.
import puppeteer from "puppeteer";
const browser = await puppeteer.launch();const page = await browser.newPage();
await page.goto("https://example.com");
const articleTitles = await page.evaluate(() => {const titles = Array.from(document.querySelectorAll("h1"));return titles.map((title) => title.innerText);});
console.log(articleTitles);
await browser.close();
If you go through the above code, it’s easer to under stand what is happening. Line 3 and 4 launches a new browser instance and open a new page. In line 6, it navigates to the URL https://example.com. The function in line 8 filter out all the title texts. In line 13 it outputs the titles scrape and line 15 closes the browser.
Above the output we are getting by running the example code.
But, Pupeteer doesn’t stop there. There are many complex features that you can leverage.
For instance, you might be scraping a site that leverages cursor pagination.
Simply put, it’s the sites where you see the “Load More” or “Next” action that lets you load more data onto the site.
Well, in such cases you need to make sure that your scraping solution traverses through all of the pages and scrapes for the data.
Here are the steps to handle pagination with Puppeteer:
Below code snippet will scrape data from a paginated web page. Here we are scraping data from a GitHub search result for “puppeteer pagination” which has a pagination.
import { launch } from "puppeteer";
const browser = await launch();const page = await browser.newPage();const url = "https://github.com/search?q=puppeteer+pagination&type=repositories";
await page.goto(url);
let data = [];let hasNextPage = true;
while (hasNextPage) {// Scrape data from the current pageconst pageData = await page.evaluate(() => {// Scraping h3 titlesconst titles = Array.from(document.querySelectorAll("h3")).map((title) => title.innerText);return titles;});
data = data.concat(pageData);
// Check if there's a "Next" button and click ithasNextPage = await page.evaluate(() => {const nextButton = document.querySelector("a.next");if (nextButton) {nextButton.click();return true;}return false;});
// Wait for navigation to complete before scraping the next pageif (hasNextPage) {await page.waitForNavigation({ waitUntil: "networkidle2" });}}
If you’re scraping a site that’s JavaScript heavy, it’s important to wait for the required content to load before it begins to scrape. This can be done with Puppeteer’s in-built methods to initiate and observe the actions of the page.
The following steps outline the general approach:
Here is an example code to scrape data from a JavaScript-heavy website.
const browser = await launch();const page = await browser.newPage();const url = "https://quotes.toscrape.com/js/";
// Navigate to the URLawait page.goto(url, { waitUntil: "networkidle2" });
// Wait for the dynamic content to load by waiting for a specific selectorawait page.waitForSelector(".quote");
// Scrape data after ensuring the content has loadedconst quotes = await page.evaluate(() => {const quoteElements = document.querySelectorAll(".quote");const quotesArray = [];quoteElements.forEach((quoteElement) => {const text = quoteElement.querySelector(".text").innerText;const author = quoteElement.querySelector(".author").innerText;quotesArray.push({ text, author });});return quotesArray;});
// Log the scraped quotesconsole.log(quotes);
Whenever you open a webpage on the browser, the browser sends or make many network calls to download HTML, CSS, JS files, images, data from APIs etc. Puppeteer enables you to capture such requests so that you get to see the data that is being transmitted. You can use this to:
For the example we have used a page in openweathermap.org which sends an API request to https://api.openweathermap.org/data/2.5/weather?id=2172797&appid=5796abbde9106b7da4febfae8c44c232 endpoint. Following code will extract data from this API request and from it’s response.
https://api.openweathermap.org/data/2.5/weather?id=2172797&appid=5796abbde9106b7da4febfae8c44c232
const browser = await launch();const page = await browser.newPage();const url = "https://openweathermap.org/city/2172797";
// Intercept network requestsawait page.setRequestInterception(true);
page.on("request", (request) => {if (request.url().includes("/data/2.5/weather")) {console.log(Intercepted request: ${request.url()});console.log("Request method:", request.method());console.log("Request headers:", request.headers());console.log("Request post data:", request.postData());}request.continue(); // Continue the request without modification});
page.on("response", async (response) => {if (response.url().includes("/data/2.5/weather")) {console.log(Captured response from: ${response.url()});console.log("Response status:", response.status());const responseData = await response.json();console.log("Response data:", JSON.stringify(responseData, null, 2));}});
await page.goto(url, { waitUntil: "networkidle2" });
Most times, you’ll run into issues where the server you’re trying to scrape data off from blocks your requests as it flags you as a bot. Well, in such cases, you’ll need to leverage tools likeProxies that help your scraper function as normal by:
Integrating proxies with Puppeteer is straightforward. You can set up Puppeteer to use a proxy server by specifying the proxy server’s address during the browser launch. Here’s how you can do it:
// Replace with your proxy server address and portconst proxyServer = "http://123.456.789.0:8080";
const browser = await launch({args: [--proxy-server=${proxyServer}],});const page = await browser.newPage();
// Optionally set your geolocation if the proxy server supports itawait page.setGeolocation({ latitude: 37.7749, longitude: -122.4194 }); // Example: San Francisco
// Navigate to a website to test the proxyawait page.goto("https://www.whatismyip.com", { waitUntil: "networkidle2" });
// Capture a screenshot to verify the IPawait page.screenshot({ path: "proxy-test.png" });
Puppeteer is the powerful tool which allows to deal with all kinds of problems starting from simple webpage scraping and going up to such complex cases, as dealing with difficult web elements, including dynamic content, real-time content and API usage.
Some of the techniques include handling websites with heavy-JavaScript, getting network requests and use of proxies can make the scraping more efficient and effective.
However, as you apply these methods, it’s essential to scrape responsibly, adhering to the legal and ethical guidelines of the websites you interact with. Puppeteer’s versatility and power make it an invaluable tool for any data extraction needs, helping you gather insights efficiently and accurately.
11 min read
Wyatt Mercer
9 min read
Ben Keane
7 min read
Kealan Parr