Home / Blog / Web Scraping / BeautifulSoup Web Scraping
Learn how to scrape websites using BeautifulSoup in this step by step guide.
Modern organizations use web scraping to gather data from various sources and improve decision-making. But when it involves collecting large amounts of data, the tool one chooses is very important because there are many tools that have different capabilities.
Among these tools, BeautifulSoup is a preferred choice because of its ease of use and extra features. It’s mostly appropriate for beginners in web scraping or those who require a simple solution to their data extraction needs. This blog will take you through the characteristics of BeautifulSoup, how to install it, and the basics of web scraping with BeautifulSoup.
BeautifulSoup is a Python library designed for quick and easy web scraping. It allows you to parse HTML and XML documents, facilitating easier navigation through the structure of the documents to retrieve the necessary data. Its straightforward syntax is particularly accessible, even for those with minimal coding backgrounds, making it a preferred choice over more complex libraries. It also performs well in processing poorly parsed HTML, which is a frequent problem when using web scraping tools.
html.parser
lxml
html5lib
Setting up BeautifulSoup is straightforward and involves just a few steps. Below is a guide to getting everything you need to start scraping.
Step 1: Install Python
Before you begin, make sure Python is installed on your system. You can download it from the official Python website.
Step 2: Install BeautifulSoup and Requests
BeautifulSoup works perfectly well with the requests library, which is used to send HTTP requests to web pages. You can install both libraries using pip:
pip install beautifulsoup4 requests
Step 3: Verify the Installation
After installation, you can verify that everything is set up correctly by running the following commands in your Python environment:
import requestsfrom bs4 import BeautifulSoup
If the import command executes without any errors or output, it means that BeautifulSoup is installed correctly.
Now that you have BeautifulSoup set up, you’re ready to start scraping!
First, create a virtual environment using Python. Although this is optional, it will help you to keep your project dependencies isolated. For that, run the following commands:
python -m venv venv
Once the virtual environment is created, activate it using the below command:
.\venv\Scripts\activate
Install BeautifulSoup and Requests packages within the virtual environment.
In your project folder, create a new Python file. You can name it something like basic_scraping.py. Open the file in your preferred code editor and add the following code:
basic_scraping.py
URL of the Real Python blog
url = 'https://realpython.com/'
Send a GET request to the website
response = requests.get(url)
Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
Find all the titles of blog posts (assuming they are in
tags with class 'card-title')
titles = soup.find_all('h2', class_='card-title')
Open a text file to save the titles
with open('scraped_titles.txt', 'w') as file:for title in titles:file.write(title.text.strip() + '\n')print(title.text.strip()) # Also print titles to the terminal
print("Scraping complete. Titles have been saved to 'scraped_titles.txt'")
In your terminal or command prompt, navigate to the directory where your script is located. Ensure your virtual environment is activated and run the script using the below command:
python basic_scraping.py
When you run the script, you should see the titles of the latest blog posts from the https://realpython.com/ website printed in your terminal. The output should look something like this:
BeautifulSoup has additional functionalities that go beyond the basics which help deal with complex scrapping situations.
Navigable strings are part of BeautifulSoup’s way of representing text within HTML tags. This feature allows you to manipulate the text directly, which can be invaluable when dealing with complex HTML structures where text is scattered with various tags.
For example, here is a simple navigatable string for your reference:
<p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p>
This is a <b>bold</b> paragraph with <i>italic
text.
So, let’s see how to manipulate this string with BeautifulSoup.
Step1: Identify the Element
First, you need to find the HTML element containing the text. In this case, it’s a <p> tag.
<p>
from bs4 import BeautifulSouphtml_content = """<div class="content"> <p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p></div>"""soup = BeautifulSoup(html_content, 'html.parser')paragraph = soup.find('p')
from bs4 import BeautifulSoup
html_content = """<div class="content"> <p>This is a <b>bold</b> paragraph with <i>italic</i> text.</p></div>"""
soup = BeautifulSoup(html_content, 'html.parser')paragraph = soup.find('p')
Step 2: Navigate and Modify Text
Loop through the contents of the element, checking if they are strings (NavigableString objects) or other tags. You can then modify the text or perform other operations like cleaning or data extraction.
from bs4 import NavigableString
for content in paragraph.contents:if isinstance(content, NavigableString):print(f"Original Text: {content}")modified_text = content.replace("italic", "emphasized")print(f"Modified Text: {modified_text}")else:print(f"Tag: {content.name}, Text: {content.text}")
This approach allows you to treat text as modifiable objects, making it easier to clean up or manipulate data directly.
CSS selectors are a more sophisticated way to select elements compared to simple tag searches. They allow for precise targeting based on attributes, classes, IDs, and their relationships within the document structure. The below example shows how to select single and multiple elements using CSS selectors.
soup.select()
soup.select_one()
soup = BeautifulSoup(html_content, 'html.parser')
Using CSS selectors to select elements
important_paragraph = soup.select_one('p.important')highlighted_text = soup.select('div#unique-element .highlight')
This method is extremely useful for complex HTML pages where elements need to be selected based on specific criteria.
Websites often use techniques like IP bans to block web scraping for various reasons, such as security, control resource consumption, etc. So, your scraping library should be able to bypass or prevent such restrictions to successfully complete the tasks.
While BeautifulSoup does not handle network requests directly, you can use the requests library to handle proxies and overcome IP bans, and connection issues.
Rotating Proxies
You can define your IP addresses as a dictionary, configure it to randomly pick one address from it and pass it on to the requests.get() function. This will allow you to avoid IP bans since you are not making excessive requests from a single IP address.
requests.get()
import randomproxies = ['http://10.10.1.10:3128','http://10.10.1.11:3128','http://10.10.1.12:3128',]
proxy = {'http': random.choice(proxies)}response = requests.get(url, proxies=proxy)
Timeouts
You can define a timeout when making a request to prevent your request keep hanging long times due to network issues.
response = requests.get(url, proxies=proxies, timeout=10)
Retry Mechanisms
In case of failure due to a reason like connection issues, you need to have a way to retry that request after a while to ensure that you are not losing data. For that, you can use HTTPAdapter and Retry methods of the requests library.
HTTPAdapter
Retry
from requests.adapters import HTTPAdapterfrom requests.packages.urllib3.util.retry import Retry
session = requests.Session()retry = Retry(connect=5, backoff_factor=0.5)adapter = HTTPAdapter(max_retries=retry)session.mount('http://', adapter)session.mount('https://', adapter)
response = session.get(url, proxies=proxies)
Using these advanced techniques with BeautifulSoup enhances your web scraping capabilities, allowing for more precise data extraction and robust handling of web requests.
BeautifulSoup is a flexible tool for web scraping offering simplicity and sophistication. Whether you are starting with simple data extraction or dealing with complex issues like advanced element selection using CSS selectors, BeautifulSoup caters for your needs. Furthermore, techniques like proxy rotation, timeouts and retries allow you to effectively address any restrictions on accessing data. These features ensure that your web scraping efforts are both efficient and reliable, making BeautifulSoup an essential tool for your data extraction projects.
7 min read
Wyatt Mercer
9 min read