Web Scraping with Playwright

Learn everythign you need to know about web scraping with Playwright

Web scraping Playwright image

Web scraping has become an essential technique for extracting valuable data from the vast expanse of the internet. Whether for market research, competitive analysis, or feeding data-hungry machine learning models, the ability to efficiently gather and process web data is a game-changer.

Across the top five marketing journals, the share of web data-based publications has more than tripled, rising from about 4% in 2010 to 15% in 2020. This surge, illustrated by a significant trend line, underscores the increasing reliance on web scraping, which accounts for 59% of these publications.


What is Playwright?

Playwright is an open-source web automation library developed by Microsoft. With this library, developers can automate browsers such as Chrome, Firefox, and WebKit, making it perfect for scraping tasks involving complex and dynamic web pages.

Key Features of Playwright:

  • Cross-Browser Support: Playwright has the power to operate in all major browsers, like Chromium, Firefox, and WebKit. Therefore, this will let you run your scraping scripts on different environments.
  • Headless Browsing: Run your scripts without launching a browser UI; it’s faster and lighter.
  • Automatic Waiting: Playwright intelligently waits for elements during actions. It minimizes the need for manual sleep calls.

Setting Up Playwright

Here’s how to set up Playwright:

Create a New Project:

mkdir playwright-scraping
cd playwright-scraping
npm init -y

Install Playwright:

npm install playwright

Install Browsers:

npx playwright install

This command installs the necessary browser binaries for Chromium, Firefox, and WebKit.


Basic Web Scraping with Playwright

Let’s start with a simple example: scraping the titles of articles from a news website.

script.js

const { chromium } = require(“playwright”);

(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com/");

const titles = await page.$$eval(".titleline > a", (links) =>
links.map((link) => link.textContent.trim())
);

console.log(titles);
await browser.close();
})();

Run the Script: Execute the script with node script.js.

This script navigates to the website and extracts the titles of the top articles.

Let’s go through this code step by step:

Importing Playwright:

const { chromium } = require("playwright");

First, import the chromium module from Playwright.

Launching the browser:

const browser = await chromium.launch({ headless: true });

We launch a Chromium browser instance.

The { headless: true } option runs the browser without a visible UI, which is faster and uses fewer resources.

Creating a new page:

const page = await browser.newPage();

This creates a new page (tab) in our browser instance.

Navigating to the target website:

await page.goto("https://news.ycombinator.com/");

Scraping the titles:

const titles = await page.$$eval(".titleline > a", (links) =>
links.map((link) => link.textContent.trim())
);

page.$$eval() is a powerful method that combines selecting elements and evaluating a function in the page context.

.titleline > a is a CSS selector that targets all <a> elements that are direct children of elements within the class titleline.

Closing the browser:

await browser.close();

It’s essential to close the browser to free up resources.


Advanced Web Scraping Techniques

As websites become more sophisticated, so must our scraping techniques. Here are some advanced methods to enhance your web scraping capabilities with Playwright.

Handling Infinite Scrolling

Many modern websites use infinite scrolling to load content dynamically as the user scrolls down the page. Here’s how you can handle this:

const { chromium } = require("playwright");

async function scrapeTweets(username, scrollTimes) {
const browser = await chromium.launch({ headless: false }); // Set to true for headless mode
const page = await browser.newPage();
await page.goto(https://twitter.com/${username});

// Wait for tweets to load
await page.waitForSelector('article[data-testid="tweet"]');

for (let i = 0; i < scrollTimes; i++) { await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000); // Wait for new content to load
}

const tweets = await page.$$eval('article[data-testid="tweet"]', (elements) =>
elements.map((el) => {
const tweetText = el
.querySelector('div[data-testid="tweetText"]')
?.textContent.trim();
const timestamp = el.querySelector("time")?.getAttribute("datetime");
return { tweetText, timestamp };
})
);

await browser.close();
return tweets;
}

(async () => {
const tweets = await scrapeTweets("elonmusk", 5); // Scrape tweets from Elon Musk's account
console.log(JSON.stringify(tweets, null, 2));
})();

This script scrolls the page several times, waiting for new content to load after each scroll.

2. Parallel Scraping

To speed up the scraping process, you can scrape multiple pages in parallel:

const { chromium } = require("playwright");

async function scrapePage(url) {
const browser = await chromium.launch();
const context = await browser.newContext(); // Create a new browser context
const page = await context.newPage(); // Create a new page in the context

try {
await page.goto(url, { waitUntil: "domcontentloaded" }); // Wait until the DOM is fully loaded

const title = await page.title();
const headlineText = await page
.$eval("h1", (el) => el.textContent.trim())
.catch(() => "No headline found");

return { url, title, headline: headlineText };
} catch (error) {
console.error(Error scraping ${url}:, error);
return { url, title: "Error", headline: "Error fetching headline" };
} finally {
await context.close(); // Close the context to free resources
await browser.close(); // Close the browser
}
}

async function scrapeInParallel(urls) {
const results = await Promise.all(urls.map((url) => scrapePage(url)));
return results;
}

(async () => {
const urls = [
"https://www.bbc.com/news",
"https://www.cnn.com",
"https://www.reuters.com",
];
const results = await scrapeInParallel(urls);
console.log(JSON.stringify(results, null, 2));
})();

This script scrapes multiple URLs concurrently, significantly reducing the time needed to scrap multiple pages.


Handling Dynamic Content with Playwright

Modern web scraping faces challenges with dynamic content, especially when JavaScript is involved. Playwright effectively addresses these issues.

Challenges with Dynamic Content:

  • JavaScript Rendering: Many sites use JavaScript to load content, which traditional scrapers miss as they only capture static HTML.
  • Asynchronous Loading: AJAX calls load data after the initial page load, complicating scraping efforts.
  • Event-Driven Content: Some data appears after user actions like clicks or scrolling, requiring scrapers to mimic these interactions.

How Playwright Overcomes These:

  • Headless Browsing: Playwright runs JavaScript and renders pages like a real browser, capturing dynamic content.
  • Smart Waiting: It waits for elements to load before interacting, ensuring data is ready for scraping.
  • User Interaction Simulation: Playwright can mimic clicks and scrolling, accessing content loaded by user actions.

Example: Scraping Content Loaded via AJAX:

const { chromium } = require("playwright");

(async () => {
// Launch the browser
const browser = await chromium.launch({ headless: false }); // Set headless: false if you want to see the browser actions
const page = await browser.newPage();

// Go to the website
await page.goto("http://quotes.toscrape.com", { waitUntil: "networkidle" });

// Wait for the dynamic content to be loaded
await page.waitForSelector(".quote");

// Extract the first quote from the page
const quote = await page.$eval(".quote .text", (el) => el.textContent.trim());

// Print the quote
console.log(quote);

// Close the browser
await browser.close();
})();

This script waits for the network to be idle, ensuring all dynamic content has loaded before attempting to scrape it.


Handling Authentication with Playwright

Some websites require user authentication to access their content. Playwright can handle login forms and manage authenticated sessions.

Example: Logging into a Website:

const { chromium } = require("playwright");

(async () => {
// Launch the browser
const browser = await chromium.launch({ headless: false }); // Set headless to false to watch the automation
const page = await browser.newPage();

try {
// Go to the login page with an increased timeout
await page.goto("https://the-internet.herokuapp.com/login", {
waitUntil: "networkidle",
timeout: 60000,
});

// Fill in the username and password
await page.fill("#username", "tomsmith");
await page.fill("#password", "SuperSecretPassword!");

// Click the login button
await page.click('button[type="submit"]');

// Wait for navigation and verify the success message
await page.waitForSelector(".flash.success");

// Extract and print the success message
const successMessage = await page.$eval(".flash.success", (el) =>
el.textContent.trim()
);
console.log("Login Success Message:", successMessage);
} catch (error) {
console.error("Error:", error);
} finally {
// Close the browser
await browser.close();
}
})();

This script fills in the username and password, clicks the login button, and waits for the page to navigate before confirming a successful login.


Managing Sessions and Cookies with Playwright

When scraping websites, especially those requiring user authentication or tracking, handling cookies and sessions can significantly increase the efficiency of your scraping efforts.

Importance of Managing Sessions and Cookies in Web Scraping

  • State Management: Cookies store user preferences and session data, enabling consistent navigation as a logged-in user.
  • Avoiding Blocks: Mimic human behavior by managing sessions and cookies, reducing the risk of being flagged as a bot.

Example: Saving and Loading Cookies:

const { chromium } = require("playwright");
const fs = require("fs");

(async () => {
const browser = await chromium.launch({ headless: false }); // Set headless to false to see the actions
const page = await browser.newPage();

// Step 1: Go to the login page and perform the login
await page.goto("https://the-internet.herokuapp.com/login", {
waitUntil: "networkidle",
});

// Fill in username and password
await page.fill("#username", "tomsmith");
await page.fill("#password", "SuperSecretPassword!");

// Click the login button
await page.click('button[type="submit"]');

// Wait for the login to complete
await page.waitForSelector(".flash.success");

// Step 2: Save the browser state (cookies, localStorage, etc.)
await page.context().storageState({ path: "state.json" });
console.log("Session saved to state.json");

// Close the browser
await browser.close();

// Step 3: Reuse the saved session for a new page
const browser2 = await chromium.launch({ headless: false });
const context = await browser2.newContext({ storageState: "state.json" });
const page2 = await context.newPage();

// Step 4: Go to a page that requires login (reusing the session)
await page2.goto("https://the-internet.herokuapp.com/secure");

// Confirm that you are logged in by checking for an element on the dashboard
if (await page2.isVisible('a[href="/logout"]')) {
console.log("Cookies and session restored successfully, and logged in!");
} else {
console.log("Failed to restore session.");
}

// Close the second browser
await browser2.close();
})();


Handling Captchas with Playwright

CAPTCHAs pose significant challenges for web scraping by distinguishing between humans and bots. Key issues include:

  • Interruption: CAPTCHAs halt scraping until solved.
  • Automation Difficulty: Designed for human solving, challenging for bots.
  • Evolving Complexity: Newer versions like reCAPTCHA are increasingly complex.

To overcome these obstacles, scrapers can integrate CAPTCHA-solving services such as 2Captcha or Anti-Captcha. These services allow scripts to solve CAPTCHAs automatically, enabling continued scraping.

Here’s a basic example using 2Captcha:

const { chromium } = require('playwright');
const axios = require('axios');
const FormData = require('form-data');

// Function to solve CAPTCHA using 2Captcha
const solveCaptcha = async (siteKey, pageUrl, apiKey) => {
const form = new FormData();
form.append('method', 'userrecaptcha');
form.append('googlekey', siteKey);
form.append('key', apiKey);
form.append('pageurl', pageUrl);
form.append('json', 1);

const response = await axios.post('http://2captcha.com/in.php', form, {
    headers: form.getHeaders(),
});

const requestId = response.data.request;

// Poll for CAPTCHA solution
while (true) {
    await new Promise(res => setTimeout(res, 5000));
    const result = await axios.get(http://2captcha.com/res.php?key=${apiKey}&action=get&id=${requestId}&json=1 );
    if (result.data.status === 1) {
        return result.data.request;
    }
}

};

const main = async () => {
const browser = await chromium.launch({ headless: true }); // Run in headless mode
const page = await browser.newPage();
const targetUrl = 'https://example.com'; // Replace with your target URL
await page.goto(targetUrl);

const siteKey = 'SITE_KEY_HERE'; // Replace with the reCAPTCHA site key
const apiKey = '2CAPTCHA_API_KEY_HERE'; // Replace with your 2Captcha API key

// Solve CAPTCHA
const captchaSolution = await solveCaptcha(siteKey, targetUrl, apiKey);

// Inject CAPTCHA solution and submit form
await page.evaluate(document.getElementById('g-recaptcha-response').innerHTML="${captchaSolution}"; );
await page.click('#submit-button'); // Replace with your form's submit button selector

await page.waitForNavigation();
console.log('CAPTCHA solved and form submitted.');

await browser.close();

};

main().catch(err => console.error(err));


How Proxies Enhance Web Scraping with Playwright

Proxies play a crucial role in web scraping by providing an intermediary between the scraper and the target website.

  • IP Rotation: Avoid blocks by distributing requests across multiple IPs, mimicking human behavior.
  • Geolocation Access: Enables bypassing of geo-restrictions, allowing data collection from diverse sources.
  • Avoiding Rate Limits: Spread requests across IPs to reduce the impact of website-imposed request limits.

Example: Using Proxies in Playwright:

const { chromium } = require('playwright');

(async () => {
// Launch browser with proxy settings
const browser = await chromium.launch({
proxy: {
server: 'http://your-proxy-server:port', // Replace with your proxy server
username: 'your-username', // Replace with your username
password: 'your-password' // Replace with your password
},
headless: true // Set to false if you want to see the browser in action
});

// Create a new page in the browser
const page = await browser.newPage();

// Navigate to the target URL
await page.goto('https://example.com', {
    waitUntil: 'networkidle' // Ensures that the page has fully loaded
});

// Perform scraping tasks - Adjust the selector to match your target element
const data = await page.textContent('.data-element');
console.log('Scraped Data:', data);

// Close the browser
await browser.close();

})();


Let’s Review

This article explored the Playwright’s setup and advanced techniques like handling infinite scrolling and parallel scraping. We discussed managing sessions and cookies and overcoming CAPTCHAs using third-party services. We also discussed the role of proxies in enhancing scraping efficiency and avoiding detection. Mastering these techniques enables professionals to navigate modern web scraping challenges, allowing valuable data extraction while upholding ethical and legal standards. Adapting scraping strategies remains essential for using data in decision-making as the digital landscape evolves.

arrow_upward