Home / Blog / Web Scraping / Web Scraping with Selenium
In this guide, we will explore Selenium to give you a comprehensive understanding of web scraping, showcasing its ability to tackle complex scenarios.
Web scraping is not something complex or hard as it may seem. Basically, web scraping is running a program to extract data from a particular website instead of manually copying data from a person.
As an example, some people and organizations scrape e-commerce sites for price monitoring, some scrape user data from websites to get insights about their interests. Having the right toolset and knowledge is the only thing you need to start web scraping.
Selenium is one of the most used open-source tools for automating web browsers which is originally designed for testing but also widely used for web scraping. It can simulate actual user activity and is designed for scraping more complex, dynamic sites where other tools may struggle.
Key features of Selenium include:
As we discussed, Selenium supports multiple program languages, including Java, C#, Ruby, and JavaScript. However, for this demonstration, we’ll be using Python due to its simplicity and extensive library support.
pip --version
Open your terminal or command prompt and run the following to install Selenium using pip,
pip install selenium
This will install the latest version of Selenium, which includes built-in support for automatic WebDriver management. This means you don’t need to manually download or configure a WebDriver; Selenium will handle it for you.
First, create a new Python file and import the Selenium’s WebDriver:
from selenium import webdriver
Now, let’s set up the WebDriver for Chrome (or another browser of your choice). In this demonstration, we will use Chrome.
driver = webdriver.Chrome()
This line is sufficient to launch Chrome and you can now navigate to a webpage using:
driver.get("https://www.selenium.dev/")
The above code line navigates to Selenium’s official documentation, and you can change the URL to experiment as you wish. After completing the task, close the browser with:
driver.quit()
Save your Python file and in the terminal, navigate to your project folder and execute the script:
python web_scraper.py
Replace the web_scraper.py with the name of your file.
Complete code snippet:
from selenium import webdriverdriver = webdriver.Chrome()driver.get("https://www.selenium.dev/")driver.quit()
This simple script will open the browser, navigate to the specified webpage, and then close the browser. This is a basic setup of Selenium. Next, we will explore our main task: web scraping using Selenium. Stay tuned!
Now it’s time to get to a simple Web Scraping example to better understand Selenium’s capabilities. For this demonstration, let’s use the BBC website to scrape a few news headlines.
Start by importing the Selenium’s WebDriver and set up the WebDriver for Chrome as we discussed earlier:
from selenium import webdriverfrom selenium.webdriver.common.by import Bydriver = webdriver.Chrome()
Use the get() method to navigate to the URL:
get()
driver.get("https://www.bbc.com/news/world")
Use an implicit wait to allow Selenium to wait for elements to appear before interacting with them:
driver.implicitly_wait(2)
This ensures that Selenium doesn’t attempt to interact with elements that haven’t fully loaded yet. Check this for more information.
We will locate the elements using the <h2> tags with the class name "sc-4fedabc7-3 bvDsJq" and the driver.find_elements method is used to find all elements on the page that match the given criteria, returning them as a list:
<h2>
"sc-4fedabc7-3 bvDsJq"
driver.find_elements
headlines = driver.find_elements(By.CLASS_NAME, "sc-4fedabc7-3.bvDsJq")
Here we used By.CLASS_NAME to specify that we’re searching for elements based on their class name. Not only this but Selenium provides various other locators such as By.Name, By.ID, By.TAG_NAME, By.CSS_SELECTOR and By.XPATH to identify elements on a webpage.
By.CLASS_NAME
Understanding how to inspect a webpage using developer tools and choose the right locator is essential for effective web scraping.
Loop through the list of headline elements and print out their text content:
for headline in headlines:print(headline.text)
Below, you can find the result of this code. This is how the scraped data is displayed in the terminal after executing the script.
After extracting the data, close the browser:
from selenium import webdriverfrom selenium.webdriver.common.by import By
Step 1: Set up the WebDriver
Step 2: Navigate to the BBC News "World" section
Step 3: Set an implicit wait
Step 4: Locate the h2 elements with the specified class name
Step 5: Loop through and print the headlines
Step 6: Close the browser
Now, we will look at how to handle advanced tasks like scraping dynamic content, and pagination with Selenium.
Many websites use pagination to display lists of items like products or articles across multiple pages. To scrape all content, you must navigate through each page, either by dynamically generating URLs or by interacting with pagination elements like “Next” buttons or page links.
Many websites have predictable URL patterns for different pages. For example, a website might use URLs like:
You can loop through these URLs in your Selenium script and scrape data from each one. Let’s look at how to do this on the Books to Scrape website to gather some data. This website is specifically designed for scraping practice and contains a wide variety of dummy books.
Example: Scrap data from Books to Scrape
On the Books to Scrape site, if you visit the 2nd page using the pagination controls, you will be redirected to the following URL:
https://books.toscrape.com/catalogue/page-2.html
If you check the URL, the number before ‘.html’ is a number that shows what page one is looking at. For instance, changing it to 1 takes you to the first page of the book list.
In this example, we will use this web address structure and extract data from the first one hundred items by browsing through multiple pages. To gather these items, you have to scrape five pages which have 20 items per page. Let’s see how it works.
from selenium import webdriverfrom selenium.webdriver.common.by import Byimport time
driver = webdriver.Chrome()books_data = []
Loop through the first 5 pages to gather at 100 items
for page in range(1, 6): # 1 to 5, inclusiveurl = f"https://books.toscrape.com/catalogue/page-{page}.html"driver.get(url)
# Extract book titles and pricesbooks = driver.find_elements(By.CLASS_NAME, "product_pod")for book in books:title = book.find_element(By.TAG_NAME, "h3").textprice = book.find_element(By.CLASS_NAME, "price_color").textbooks_data.append({'title': title, 'price': price})
# Add a random delay between page loads to mimic human behaviortime.sleep(2)
# If we have collected 100 items, break the loopif len(books_data) >= 100:break
Display the collected data
for i, book in enumerate(books_data[:100], start=1):print(f"{i}. Title: {book['title']}, Price: {book['price']}")
Above, you can find the complete code snippet for this process. The first step is to dynamically generate the URL to visit multiple pages. To do this, we created a simple for loop that increments the page number programmatically, allowing us to visit each page and scrape the data:
After visiting each page, our task is to scrape the necessary data, and add it to an array:
books = driver.find_elements(By.CLASS_NAME, "product_pod")for book in books:title = book.find_element(By.TAG_NAME, "h3").textprice = book.find_element(By.CLASS_NAME, "price_color").textbooks_data.append({'title': title, 'price': price})
Now we have all the scraped data in the books_data array, and we printed it using a simple for loop:
Dynamic URLs aren’t the only way to handle pagination. Selenium also lets us interact with the website with actions like clicking buttons.
On the Books to Scrape site, you’ll find a ‘Next’ button at the bottom right of the pagination bar. In this example, we’ll use this button to handle pagination and scrape the data. Let’s see how it works:
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.common.exceptions import NoSuchElementExceptionimport time
driver = webdriver.Chrome()books_data = []driver.get('https://books.toscrape.com/')
Loop through pages by clicking the "Next" button until 100 items are collected
while True:books = driver.find_elements(By.CLASS_NAME, "product_pod")for book in books:title = book.find_element(By.TAG_NAME, "h3").textprice = book.find_element(By.CLASS_NAME, "price_color").textbooks_data.append({'title': title, 'price': price})
if len(books_data) >= 100:break
# Try to find and click the "Next" buttontry:next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")next_button.click()except NoSuchElementException:print("No more pages to load.")break
time.sleep(2)
In this example, the key focus is interacting with the website by clicking the ‘Next’ button. Similar to finding elements by class name, we locate the button and use Selenium’s click() command to interact with it.
click()
next_button = driver.find_element(By.CSS_SELECTOR, "li.next a")next_button.click()
Selenium is powerful but it doesn’t come with built-in retry mechanism. For instance, handling dynamic pages, network problems, and slow-loading elements becomes difficult using selenium. One way to overcome these difficulties is through implementing your own error handling and retry logic.
Instead of letting the script crash when a web element is not found or an interaction fails due to a transient issue, you may opt to catch the exception and retry the operation. This approach is particularly useful when dealing with elements that may take longer to load or if the network is slow.
Here’s how you can implement error handling and retries in Selenium:
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedExceptionimport time
def find_element_with_retries(driver, by, value, retries=3, delay=2):for attempt in range(retries):try:element = driver.find_element(by, value)return elementexcept (NoSuchElementException, ElementClickInterceptedException) as e:print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay} seconds…")time.sleep(delay)raise Exception(f"Failed to find element after {retries} retries")
driver.get("https://example.com")
try:button = find_element_with_retries(driver, By.ID, "submit-button", retries=5, delay=3)button.click()except Exception as e:print(f"Error: {e}. Could not complete the operation.")
Step 5: Close the browser
The find_element_with_retries function demonstrates how to implement exception handling and retries in web scraping. It tries to find an element and, if unsuccessful, raises a NoSuchElementException or ElementClickInterceptedException. After a delay, it retries until the element is found or the retry limit is reached.
find_element_with_retries
NoSuchElementException
ElementClickInterceptedException
After scraping, you usually require storing the data for analysis, operation, and so forth. One common method of saving scraped data is by writing the data to a JSON file. JSON is a lightweight and easy to read format of data exchange which is suitable for storage.
Let’s take a look at how to store data in a JSON file after scraping from the Books to Scrape website:
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.common.exceptions import NoSuchElementExceptionimport timeimport json
Save the collected data to a JSON file
with open('books_data.json', 'w') as json_file:json.dump(books_data, json_file, indent=4)
Here it is, we replaced the previous code with this new one by adding the json.dump method, which generates a JSON file and writes the content to this as instructed by converting the Python list to a JSON structure. To make this work, you need to import Python’s built-in json module.
#Import the module
Import the module
import json
After execution, you’ll find a books_data.json file containing the data list.
books_data.json
A proxy serves as a ‘middleman’ and helps to forward your requests to the website you are scraping using another IP address. Through proxies, you can hide your real IP address and give the appearance that the requests are coming from different areas.
This is particularly useful when scraping large amounts of data or accessing content that varies by region. The benefits of proxies include:
Here is how you can configure a proxy
A proxy server allows you to route your web requests through a different IP address. You can explore both free and paid options to find a proxy server. For this example, you can get a free proxy address and add it to your code as a variable.
proxy = "http://your-proxy-address:port"
You’ll need to configure Selenium to use this proxy when launching the browser. This is done by passing the proxy settings as an argument to the WebDriver.
from selenium import webdriverfrom selenium.webdriver.chrome.options import Options
# Step 1: Set up the proxy settings
proxy = "http://your-proxy-address:port" # Replace with your actual proxychrome_options = Options()chrome_options.add_argument(f'--proxy-server={proxy}')
# Step 2: Set up the WebDriver with the proxy settings
Step 2: Set up the WebDriver with the proxy settings
driver = webdriver.Chrome(options=chrome_options)
Now, you can use Selenium as usual. The difference is that all your requests will be routed through the proxy, making it appear as though they are coming from the proxy’s IP address.
For example, you can visit the httpbin website to check your IP address:
Step 3: Navigate to a website to check the IP address
driver.get("https://httpbin.org/ip")
Step 4: Print the page source to confirm the proxy is working
print(driver.page_source)
Step 1: Set up the proxy settings
Selenium is a great tool when it comes to collecting information, performing various operations, or overcoming complex web structures. It provides all the functionalities you need for web scraping and is especially suitable for dynamic websites. You can further improve this process by learning advanced techniques like pagination, retries, and using proxies.
If you followed the examples in this article, you should have enough knowledge to start a web scraping project using Selenium easily.
8 min read
Jonathan Schmidt
7 min read
Wyatt Mercer
9 min read