Web Scraping with Selenium Guide

In this guide, we will explore Selenium to give you a comprehensive understanding of web scraping, showcasing its ability to tackle complex scenarios.

Web scraping Selenium image

Web scraping is not something complex or hard as it may seem. Basically, web scraping is running a program to extract data from a particular website instead of manually copying data from a person.

As an example, some people and organizations scrape e-commerce sites for price monitoring, some scrape user data from websites to get insights about their interests. Having the right toolset and knowledge is the only thing you need to start web scraping.


What is Selenium?

Selenium is one of the most used open-source tools for automating web browsers which is originally designed for testing but also widely used for web scraping. It can simulate actual user activity and is designed for scraping more complex, dynamic sites where other tools may struggle.

Key features of Selenium include:

  • Cross-browser compatibility: Selenium supports all the major browsers like Chrome, Firefox, Safari, and Edge. This allows us to run our same selenium script across different browsers without modifying it.
  • Multi-language support: Selenium allows you to use your same web scraping script in different languages like Python, Java, and JavaScript. This is possible because Selenium has provided official language bindings (libraries) for each language.
  • Dynamic content interaction: One of Selenium’s standout features is its ability to interact with dynamic, JavaScript-driven content. This can handle content that loads after the page, such as AJAX or real-time updates.

Installing and Setting up Selenium

As we discussed, Selenium supports multiple program languages, including Java, C#, Ruby, and JavaScript. However, for this demonstration, we’ll be using Python due to its simplicity and extensive library support.

Prerequisites:

  1. Python Installed: Ensure Python is installed on your machine. You can download it from Python’s official website.
  2. Pip (Python Package Installer): pip is essential for installing Selenium and other Python packages. It typically comes bundled with Python, but you can verify its installation by running pip --version in your terminal or command prompt.

Step 1: Install Selenium

Open your terminal or command prompt and run the following to install Selenium using pip,

pip install selenium

This will install the latest version of Selenium, which includes built-in support for automatic WebDriver management. This means you don’t need to manually download or configure a WebDriver; Selenium will handle it for you.

Step 2: Write Your First Selenium Script

First, create a new Python file and import the Selenium’s WebDriver:

from selenium import webdriver

Now, let’s set up the WebDriver for Chrome (or another browser of your choice). In this demonstration, we will use Chrome.

driver = webdriver.Chrome()

This line is sufficient to launch Chrome and you can now navigate to a webpage using:

driver.get("https://www.selenium.dev/")

The above code line navigates to Selenium’s official documentation, and you can change the URL to experiment as you wish. After completing the task, close the browser with:

driver.quit()

Step 3: Run Your Script

Save your Python file and in the terminal, navigate to your project folder and execute the script:

python web_scraper.py

Replace the web_scraper.py with the name of your file.

Complete code snippet:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.selenium.dev/")
driver.quit()

This simple script will open the browser, navigate to the specified webpage, and then close the browser. This is a basic setup of Selenium. Next, we will explore our main task: web scraping using Selenium. Stay tuned!


Basic Web Scraping with Selenium

Now it’s time to get to a simple Web Scraping example to better understand Selenium’s capabilities. For this demonstration, let’s use the BBC website to scrape a few news headlines.

Step 1: Import Required Libraries and Set up the Code

Start by importing the Selenium’s WebDriver and set up the WebDriver for Chrome as we discussed earlier:

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()

Step 2: Navigate to the BBC News “World” Section

Use the get() method to navigate to the URL:

driver.get("https://www.bbc.com/news/world")

Step 3: Set an Implicit Wait

Use an implicit wait to allow Selenium to wait for elements to appear before interacting with them:

driver.implicitly_wait(2)

This ensures that Selenium doesn’t attempt to interact with elements that haven’t fully loaded yet. Check this for more information.

4. Locate and Extract the Headlines

We will locate the elements using the <h2> tags with the class name "sc-4fedabc7-3 bvDsJq" and the driver.find_elements method is used to find all elements on the page that match the given criteria, returning them as a list:

headlines = driver.find_elements(By.CLASS_NAME, "sc-4fedabc7-3.bvDsJq")

Here we used By.CLASS_NAME to specify that we’re searching for elements based on their class name. Not only this but Selenium provides various other locators such as By.Name, By.ID, By.TAG_NAME, By.CSS_SELECTOR and By.XPATH to identify elements on a webpage.

Understanding how to inspect a webpage using developer tools and choose the right locator is essential for effective web scraping.

5. Loop Through and Print the Headlines

Loop through the list of headline elements and print out their text content:

for headline in headlines:
print(headline.text)

Below, you can find the result of this code. This is how the scraped data is displayed in the terminal after executing the script.

6. Close the Browser

After extracting the data, close the browser:

driver.quit()

Complete code snippet:

from selenium import webdriver
from selenium.webdriver.common.by import By

Step 1: Set up the WebDriver

driver = webdriver.Chrome()

Step 2: Navigate to the BBC News "World" section

driver.get("https://www.bbc.com/news/world")

Step 3: Set an implicit wait

driver.implicitly_wait(2)

Step 4: Locate the h2 elements with the specified class name

headlines = driver.find_elements(By.CLASS_NAME, "sc-4fedabc7-3.bvDsJq")

Step 5: Loop through and print the headlines

for headline in headlines:
print(headline.text)

Step 6: Close the browser

driver.quit()


Advanced Web Scraping Techniques

Now, we will look at how to handle advanced tasks like scraping dynamic content, and pagination with Selenium.

1. Handling Pagination with Selenium

Many websites use pagination to display lists of items like products or articles across multiple pages. To scrape all content, you must navigate through each page, either by dynamically generating URLs or by interacting with pagination elements like “Next” buttons or page links.

Method 1: Dynamically Generating URLs

Many websites have predictable URL patterns for different pages. For example, a website might use URLs like:

  • https://example.com/page/1
  • https://example.com/page/2
  • https://example.com/page/3

You can loop through these URLs in your Selenium script and scrape data from each one. Let’s look at how to do this on the Books to Scrape website to gather some data. This website is specifically designed for scraping practice and contains a wide variety of dummy books.

Example: Scrap data from Books to Scrape

On the Books to Scrape site, if you visit the 2nd page using the pagination controls, you will be redirected to the following URL:

https://books.toscrape.com/catalogue/page-2.html

If you check the URL, the number before ‘.html’ is a number that shows what page one is looking at. For instance, changing it to 1 takes you to the first page of the book list.

In this example, we will use this web address structure and extract data from the first one hundred items by browsing through multiple pages. To gather these items, you have to scrape five pages which have 20 items per page. Let’s see how it works.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
books_data = []

Loop through the first 5 pages to gather at 100 items

for page in range(1, 6): # 1 to 5, inclusive
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
driver.get(url)

# Extract book titles and prices
books = driver.find_elements(By.CLASS_NAME, "product_pod")
for book in books:
title = book.find_element(By.TAG_NAME, "h3").text
price = book.find_element(By.CLASS_NAME, "price_color").text
books_data.append({'title': title, 'price': price})

# Add a random delay between page loads to mimic human behavior
time.sleep(2)

# If we have collected 100 items, break the loop
if len(books_data) >= 100:
break

Display the collected data

for i, book in enumerate(books_data[:100], start=1):
print(f"{i}. Title: {book['title']}, Price: {book['price']}")

driver.quit()

Above, you can find the complete code snippet for this process. The first step is to dynamically generate the URL to visit multiple pages. To do this, we created a simple for loop that increments the page number programmatically, allowing us to visit each page and scrape the data:

for page in range(1, 6): # 1 to 5, inclusive
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
driver.get(url)

After visiting each page, our task is to scrape the necessary data, and add it to an array:

books = driver.find_elements(By.CLASS_NAME, "product_pod")
for book in books:
title = book.find_element(By.TAG_NAME, "h3").text
price = book.find_element(By.CLASS_NAME, "price_color").text
books_data.append({'title': title, 'price': price})

Now we have all the scraped data in the books_data array, and we printed it using a simple for loop:

for i, book in enumerate(books_data[:100], start=1):
print(f"{i}. Title: {book['title']}, Price: {book['price']}")

Method 2: Handle Pagination by “Next” Button Clicking

Dynamic URLs aren’t the only way to handle pagination. Selenium also lets us interact with the website with actions like clicking buttons.

On the Books to Scrape site, you’ll find a ‘Next’ button at the bottom right of the pagination bar. In this example, we’ll use this button to handle pagination and scrape the data. Let’s see how it works:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time

driver = webdriver.Chrome()
books_data = []
driver.get('https://books.toscrape.com/')

Loop through pages by clicking the "Next" button until 100 items are collected

while True:
books = driver.find_elements(By.CLASS_NAME, "product_pod")
for book in books:
title = book.find_element(By.TAG_NAME, "h3").text
price = book.find_element(By.CLASS_NAME, "price_color").text
books_data.append({'title': title, 'price': price})

if len(books_data) >= 100:
break

# Try to find and click the "Next" button
try:
next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")
next_button.click()
except NoSuchElementException:
print("No more pages to load.")
break

time.sleep(2)

for i, book in enumerate(books_data[:100], start=1):
print(f"{i}. Title: {book['title']}, Price: {book['price']}")

driver.quit()

In this example, the key focus is interacting with the website by clicking the ‘Next’ button. Similar to finding elements by class name, we locate the button and use Selenium’s click() command to interact with it.

next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")
next_button.click()

Error Handling and Retries with Selenium

Selenium is powerful but it doesn’t come with built-in retry mechanism. For instance, handling dynamic pages, network problems, and slow-loading elements becomes difficult using selenium. One way to overcome these difficulties is through implementing your own error handling and retry logic.

Implementing Error Handling and Retries

Instead of letting the script crash when a web element is not found or an interaction fails due to a transient issue, you may opt to catch the exception and retry the operation. This approach is particularly useful when dealing with elements that may take longer to load or if the network is slow.

Here’s how you can implement error handling and retries in Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
import time

driver = webdriver.Chrome()

def find_element_with_retries(driver, by, value, retries=3, delay=2):
for attempt in range(retries):
try:
element = driver.find_element(by, value)
return element
except (NoSuchElementException, ElementClickInterceptedException) as e:
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay} seconds…")
time.sleep(delay)
raise Exception(f"Failed to find element after {retries} retries")

driver.get("https://example.com")

try:
button = find_element_with_retries(driver, By.ID, "submit-button", retries=5, delay=3)
button.click()
except Exception as e:
print(f"Error: {e}. Could not complete the operation.")

Step 5: Close the browser

driver.quit()

The find_element_with_retries function demonstrates how to implement exception handling and retries in web scraping. It tries to find an element and, if unsuccessful, raises a NoSuchElementException or ElementClickInterceptedException. After a delay, it retries until the element is found or the retry limit is reached.

3. Storing and Processing Data with Selenium

After scraping, you usually require storing the data for analysis, operation, and so forth. One common method of saving scraped data is by writing the data to a JSON file. JSON is a lightweight and easy to read format of data exchange which is suitable for storage.

Let’s take a look at how to store data in a JSON file after scraping from the Books to Scrape website:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time
import json

driver = webdriver.Chrome()
books_data = []
driver.get('https://books.toscrape.com/')

Loop through pages by clicking the "Next" button until 100 items are collected

while True:
books = driver.find_elements(By.CLASS_NAME, "product_pod")
for book in books:
title = book.find_element(By.TAG_NAME, "h3").text
price = book.find_element(By.CLASS_NAME, "price_color").text
books_data.append({'title': title, 'price': price})

if len(books_data) >= 100:
break

# Try to find and click the "Next" button
try:
next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")
next_button.click()
except NoSuchElementException:
print("No more pages to load.")
break

time.sleep(2)

for i, book in enumerate(books_data[:100], start=1):
print(f"{i}. Title: {book['title']}, Price: {book['price']}")

Save the collected data to a JSON file

with open('books_data.json', 'w') as json_file:
json.dump(books_data, json_file, indent=4)

driver.quit()

Here it is, we replaced the previous code with this new one by adding the json.dump method, which generates a JSON file and writes the content to this as instructed by converting the Python list to a JSON structure. To make this work, you need to import Python’s built-in json module.

#Import the module

import json

with open('books_data.json', 'w') as json_file:
json.dump(books_data, json_file, indent=4)

After execution, you’ll find a books_data.json file containing the data list.


How Proxies Enhance Web Scraping

A proxy serves as a ‘middleman’ and helps to forward your requests to the website you are scraping using another IP address. Through proxies, you can hide your real IP address and give the appearance that the requests are coming from different areas.

This is particularly useful when scraping large amounts of data or accessing content that varies by region. The benefits of proxies include:

  • Avoiding IP Bans: Distributes requests to prevent detection and blocking.
  • Geolocation Flexibility: Access region-specific content by routing through different locations.
  • Anonymity: Hides your real IP address, increasing privacy.
  • Bypassing Rate Limits: Sends requests from multiple IPs to avoid hitting rate limits.

Here is how you can configure a proxy

Step 1: Set up proxy configuration

A proxy server allows you to route your web requests through a different IP address. You can explore both free and paid options to find a proxy server. For this example, you can get a free proxy address and add it to your code as a variable.

proxy = "http://your-proxy-address:port"

Step 2: Configure Selenium to use the Proxy

You’ll need to configure Selenium to use this proxy when launching the browser. This is done by passing the proxy settings as an argument to the WebDriver.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Step 1: Set up the proxy settings

proxy = "http://your-proxy-address:port" # Replace with your actual proxy
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy}')

# Step 2: Set up the WebDriver with the proxy settings

driver = webdriver.Chrome(options=chrome_options)

Step 3: Navigate to a Website

Now, you can use Selenium as usual. The difference is that all your requests will be routed through the proxy, making it appear as though they are coming from the proxy’s IP address.

For example, you can visit the httpbin website to check your IP address:

Step 3: Navigate to a website to check the IP address

driver.get("https://httpbin.org/ip")

Step 4: Print the page source to confirm the proxy is working

print(driver.page_source)

Complete code snippet:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Step 1: Set up the proxy settings

proxy = "http://your-proxy-address:port" # Replace with your actual proxy
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy}')

Step 2: Set up the WebDriver with the proxy settings

driver = webdriver.Chrome(options=chrome_options)

Step 3: Navigate to a website to check the IP address

driver.get("https://httpbin.org/ip")

Step 4: Print the page source to confirm the proxy is working

print(driver.page_source)

Step 5: Close the browser

driver.quit()


Conclusion

Selenium is a great tool when it comes to collecting information, performing various operations, or overcoming complex web structures. It provides all the functionalities you need for web scraping and is especially suitable for dynamic websites. You can further improve this process by learning advanced techniques like pagination, retries, and using proxies.

If you followed the examples in this article, you should have enough knowledge to start a web scraping project using Selenium easily.

arrow_upward