Home / Blog / Web Scraping / Web Scraping with Playwright
Learn everythign you need to know about web scraping with Playwright
Web scraping has become an essential technique for extracting valuable data from the vast expanse of the internet. Whether for market research, competitive analysis, or feeding data-hungry machine learning models, the ability to efficiently gather and process web data is a game-changer.
Across the top five marketing journals, the share of web data-based publications has more than tripled, rising from about 4% in 2010 to 15% in 2020. This surge, illustrated by a significant trend line, underscores the increasing reliance on web scraping, which accounts for 59% of these publications.
Playwright is an open-source web automation library developed by Microsoft. With this library, developers can automate browsers such as Chrome, Firefox, and WebKit, making it perfect for scraping tasks involving complex and dynamic web pages.
Key Features of Playwright:
Here’s how to set up Playwright:
Create a New Project:
mkdir playwright-scrapingcd playwright-scrapingnpm init -y
Install Playwright:
npm install playwright
Install Browsers:
npx playwright install
This command installs the necessary browser binaries for Chromium, Firefox, and WebKit.
Let’s start with a simple example: scraping the titles of articles from a news website.
script.js
const { chromium } = require(“playwright”);
(async () => {const browser = await chromium.launch({ headless: true });const page = await browser.newPage();await page.goto("https://news.ycombinator.com/");
const titles = await page.$$eval(".titleline > a", (links) =>links.map((link) => link.textContent.trim()));
console.log(titles);await browser.close();})();
Run the Script: Execute the script with node script.js.
This script navigates to the website and extracts the titles of the top articles.
Let’s go through this code step by step:
Importing Playwright:
const { chromium } = require("playwright");
First, import the chromium module from Playwright.
Launching the browser:
const browser = await chromium.launch({ headless: true });
We launch a Chromium browser instance.
The { headless: true } option runs the browser without a visible UI, which is faster and uses fewer resources.
{ headless: true }
Creating a new page:
const page = await browser.newPage();
This creates a new page (tab) in our browser instance.
Navigating to the target website:
await page.goto("https://news.ycombinator.com/");
Scraping the titles:
page.$$eval() is a powerful method that combines selecting elements and evaluating a function in the page context.
page.$$eval()
.titleline > a is a CSS selector that targets all <a> elements that are direct children of elements within the class titleline.
.titleline > a
<a>
titleline
Closing the browser:
await browser.close();
It’s essential to close the browser to free up resources.
As websites become more sophisticated, so must our scraping techniques. Here are some advanced methods to enhance your web scraping capabilities with Playwright.
Many modern websites use infinite scrolling to load content dynamically as the user scrolls down the page. Here’s how you can handle this:
async function scrapeTweets(username, scrollTimes) {const browser = await chromium.launch({ headless: false }); // Set to true for headless modeconst page = await browser.newPage();await page.goto(https://twitter.com/${username});
// Wait for tweets to loadawait page.waitForSelector('article[data-testid="tweet"]');
for (let i = 0; i < scrollTimes; i++) { await page.evaluate(() => {window.scrollTo(0, document.body.scrollHeight);});await page.waitForTimeout(2000); // Wait for new content to load}
const tweets = await page.$$eval('article[data-testid="tweet"]', (elements) =>elements.map((el) => {const tweetText = el.querySelector('div[data-testid="tweetText"]')?.textContent.trim();const timestamp = el.querySelector("time")?.getAttribute("datetime");return { tweetText, timestamp };}));
await browser.close();return tweets;}
(async () => {const tweets = await scrapeTweets("elonmusk", 5); // Scrape tweets from Elon Musk's accountconsole.log(JSON.stringify(tweets, null, 2));})();
This script scrolls the page several times, waiting for new content to load after each scroll.
To speed up the scraping process, you can scrape multiple pages in parallel:
async function scrapePage(url) {const browser = await chromium.launch();const context = await browser.newContext(); // Create a new browser contextconst page = await context.newPage(); // Create a new page in the context
try {await page.goto(url, { waitUntil: "domcontentloaded" }); // Wait until the DOM is fully loaded
const title = await page.title();const headlineText = await page.$eval("h1", (el) => el.textContent.trim()).catch(() => "No headline found");
return { url, title, headline: headlineText };} catch (error) {console.error(Error scraping ${url}:, error);return { url, title: "Error", headline: "Error fetching headline" };} finally {await context.close(); // Close the context to free resourcesawait browser.close(); // Close the browser}}
async function scrapeInParallel(urls) {const results = await Promise.all(urls.map((url) => scrapePage(url)));return results;}
(async () => {const urls = ["https://www.bbc.com/news","https://www.cnn.com","https://www.reuters.com",];const results = await scrapeInParallel(urls);console.log(JSON.stringify(results, null, 2));})();
This script scrapes multiple URLs concurrently, significantly reducing the time needed to scrap multiple pages.
Modern web scraping faces challenges with dynamic content, especially when JavaScript is involved. Playwright effectively addresses these issues.
Challenges with Dynamic Content:
How Playwright Overcomes These:
Example: Scraping Content Loaded via AJAX:
(async () => {// Launch the browserconst browser = await chromium.launch({ headless: false }); // Set headless: false if you want to see the browser actionsconst page = await browser.newPage();
// Go to the websiteawait page.goto("http://quotes.toscrape.com", { waitUntil: "networkidle" });
// Wait for the dynamic content to be loadedawait page.waitForSelector(".quote");
// Extract the first quote from the pageconst quote = await page.$eval(".quote .text", (el) => el.textContent.trim());
// Print the quoteconsole.log(quote);
// Close the browserawait browser.close();})();
This script waits for the network to be idle, ensuring all dynamic content has loaded before attempting to scrape it.
Some websites require user authentication to access their content. Playwright can handle login forms and manage authenticated sessions.
Example: Logging into a Website:
(async () => {// Launch the browserconst browser = await chromium.launch({ headless: false }); // Set headless to false to watch the automationconst page = await browser.newPage();
try {// Go to the login page with an increased timeoutawait page.goto("https://the-internet.herokuapp.com/login", {waitUntil: "networkidle",timeout: 60000,});
// Fill in the username and passwordawait page.fill("#username", "tomsmith");await page.fill("#password", "SuperSecretPassword!");
// Click the login buttonawait page.click('button[type="submit"]');
// Wait for navigation and verify the success messageawait page.waitForSelector(".flash.success");
// Extract and print the success messageconst successMessage = await page.$eval(".flash.success", (el) =>el.textContent.trim());console.log("Login Success Message:", successMessage);} catch (error) {console.error("Error:", error);} finally {// Close the browserawait browser.close();}})();
This script fills in the username and password, clicks the login button, and waits for the page to navigate before confirming a successful login.
When scraping websites, especially those requiring user authentication or tracking, handling cookies and sessions can significantly increase the efficiency of your scraping efforts.
Importance of Managing Sessions and Cookies in Web Scraping
Example: Saving and Loading Cookies:
const { chromium } = require("playwright");const fs = require("fs");
(async () => {const browser = await chromium.launch({ headless: false }); // Set headless to false to see the actionsconst page = await browser.newPage();
// Step 1: Go to the login page and perform the loginawait page.goto("https://the-internet.herokuapp.com/login", {waitUntil: "networkidle",});
// Fill in username and passwordawait page.fill("#username", "tomsmith");await page.fill("#password", "SuperSecretPassword!");
// Wait for the login to completeawait page.waitForSelector(".flash.success");
// Step 2: Save the browser state (cookies, localStorage, etc.)await page.context().storageState({ path: "state.json" });console.log("Session saved to state.json");
// Close the browserawait browser.close();
// Step 3: Reuse the saved session for a new pageconst browser2 = await chromium.launch({ headless: false });const context = await browser2.newContext({ storageState: "state.json" });const page2 = await context.newPage();
// Step 4: Go to a page that requires login (reusing the session)await page2.goto("https://the-internet.herokuapp.com/secure");
// Confirm that you are logged in by checking for an element on the dashboardif (await page2.isVisible('a[href="/logout"]')) {console.log("Cookies and session restored successfully, and logged in!");} else {console.log("Failed to restore session.");}
// Close the second browserawait browser2.close();})();
CAPTCHAs pose significant challenges for web scraping by distinguishing between humans and bots. Key issues include:
To overcome these obstacles, scrapers can integrate CAPTCHA-solving services such as 2Captcha or Anti-Captcha. These services allow scripts to solve CAPTCHAs automatically, enabling continued scraping.
Here’s a basic example using 2Captcha:
const { chromium } = require('playwright');const axios = require('axios');const FormData = require('form-data');
// Function to solve CAPTCHA using 2Captchaconst solveCaptcha = async (siteKey, pageUrl, apiKey) => {const form = new FormData();form.append('method', 'userrecaptcha');form.append('googlekey', siteKey);form.append('key', apiKey);form.append('pageurl', pageUrl);form.append('json', 1);
const response = await axios.post('http://2captcha.com/in.php', form, { headers: form.getHeaders(), }); const requestId = response.data.request; // Poll for CAPTCHA solution while (true) { await new Promise(res => setTimeout(res, 5000)); const result = await axios.get(http://2captcha.com/res.php?key=${apiKey}&action=get&id=${requestId}&json=1 ); if (result.data.status === 1) { return result.data.request; } }
const response = await axios.post('http://2captcha.com/in.php', form, { headers: form.getHeaders(), });
const requestId = response.data.request;
// Poll for CAPTCHA solution while (true) { await new Promise(res => setTimeout(res, 5000)); const result = await axios.get(http://2captcha.com/res.php?key=${apiKey}&action=get&id=${requestId}&json=1 ); if (result.data.status === 1) { return result.data.request; } }
};
const main = async () => {const browser = await chromium.launch({ headless: true }); // Run in headless modeconst page = await browser.newPage();const targetUrl = 'https://example.com'; // Replace with your target URLawait page.goto(targetUrl);
const siteKey = 'SITE_KEY_HERE'; // Replace with the reCAPTCHA site key const apiKey = '2CAPTCHA_API_KEY_HERE'; // Replace with your 2Captcha API key // Solve CAPTCHA const captchaSolution = await solveCaptcha(siteKey, targetUrl, apiKey); // Inject CAPTCHA solution and submit form await page.evaluate(document.getElementById('g-recaptcha-response').innerHTML="${captchaSolution}"; ); await page.click('#submit-button'); // Replace with your form's submit button selector await page.waitForNavigation(); console.log('CAPTCHA solved and form submitted.'); await browser.close();
const siteKey = 'SITE_KEY_HERE'; // Replace with the reCAPTCHA site key const apiKey = '2CAPTCHA_API_KEY_HERE'; // Replace with your 2Captcha API key
// Solve CAPTCHA const captchaSolution = await solveCaptcha(siteKey, targetUrl, apiKey);
// Inject CAPTCHA solution and submit form await page.evaluate(document.getElementById('g-recaptcha-response').innerHTML="${captchaSolution}"; ); await page.click('#submit-button'); // Replace with your form's submit button selector
await page.waitForNavigation(); console.log('CAPTCHA solved and form submitted.');
main().catch(err => console.error(err));
Proxies play a crucial role in web scraping by providing an intermediary between the scraper and the target website.
Example: Using Proxies in Playwright:
const { chromium } = require('playwright');
(async () => {// Launch browser with proxy settingsconst browser = await chromium.launch({proxy: {server: 'http://your-proxy-server:port', // Replace with your proxy serverusername: 'your-username', // Replace with your usernamepassword: 'your-password' // Replace with your password},headless: true // Set to false if you want to see the browser in action});
// Create a new page in the browser const page = await browser.newPage(); // Navigate to the target URL await page.goto('https://example.com', { waitUntil: 'networkidle' // Ensures that the page has fully loaded }); // Perform scraping tasks - Adjust the selector to match your target element const data = await page.textContent('.data-element'); console.log('Scraped Data:', data); // Close the browser await browser.close();
// Create a new page in the browser const page = await browser.newPage();
// Navigate to the target URL await page.goto('https://example.com', { waitUntil: 'networkidle' // Ensures that the page has fully loaded });
// Perform scraping tasks - Adjust the selector to match your target element const data = await page.textContent('.data-element'); console.log('Scraped Data:', data);
// Close the browser await browser.close();
})();
This article explored the Playwright’s setup and advanced techniques like handling infinite scrolling and parallel scraping. We discussed managing sessions and cookies and overcoming CAPTCHAs using third-party services. We also discussed the role of proxies in enhancing scraping efficiency and avoiding detection. Mastering these techniques enables professionals to navigate modern web scraping challenges, allowing valuable data extraction while upholding ethical and legal standards. Adapting scraping strategies remains essential for using data in decision-making as the digital landscape evolves.
8 min read
Ben Keane
10 min read
Wyatt Mercer
9 min read
Jonathan Schmidt