Home / Blog / Web Scraping / Python Web Scraping Guide
Step by step guide for web scraping with Python. Learn how to scrape both dynamic and static websites.
Website data is useful for all kinds of things, including market research, price monitoring, lead generation, and analysis. However, not all websites provide APIs to access this data, which is where web scraping comes in. Web scraping allows you to programmatically extract useful information from websites, making it easy to quickly and efficiently gather large volumes of data.
In this tutorial, you’ll learn how to scrape both static and dynamic content using Python.
Essentially, scraping is like browsing the web programmatically: you fetch a page’s content, sift through the information, extract what you need, and store it for later use or analysis.
There are two types of content you can scrape: dynamic and static.
Python is a great language for web scraping because of its simplicity and the wide range of tools and libraries available. Following are some of the most popular Python libraries for web scraping. These libraries handle different aspects of the scraping process, from making web requests to parsing HTML and mimicking web browser sessions:
Now that you know a little more about web scraping in Python, it’s time to build a web scraper that collects job listings from the Python Job Board and saves them into a CSV file. This type of scraping can help automate your job search or conduct market research on the state of Python jobs (frequency, location, duties, salary, etc):
Before you begin, make sure you have the following:
Before you start web scraping, you need to set up a virtual environment that will help you manage your dependencies and avoid conflicts with other installed packages. To do so, create a project directory where all your files will live by running the following command:
mkdir python_job_scraper cd python_job_scraper
Then, in your project directory, create a virtual environment:
python3 -m venv venv
And activate it:
source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows
You also need to have the following Python packages installed:
requests
beautifulsoup4
csv
You can install them using pip with the following command:
pip install requests beautifulsoup4
Finally, create a new file in your project directory and name it app.py. This is where you’ll write your web scraping code.
app.py
Now that your environment is set up, it’s time to start writing your script to scrape static content from the job board website. Open the app.py file and begin by importing the necessary libraries and fetching the HTML content of the job board:
import requests from bs4 import BeautifulSoup # Initialize the URL for the job listings url = 'https://www.python.org/jobs/' # Send a GET request to fetch the content of the page response = requests.get(url) html_content = response.text print(html_content)
This code uses the requests library to send a GET request to the job board URL. The response contains the HTML content of the page, which you’ll parse soon.
You can run the code and all other snippets in this tutorial by opening your terminal and entering the following command:
python app.py
This code prints the HTML content of the job board website. Before you parse the HTML, it’s important to understand its structure. To do so, open the Python Job Board page in your browser. Then, right-click on a job listing and select Inspect (or use Ctrl + Shift + I on Windows/Linux or Cmd + Opt + I on macOS):
Now, look at the HTML structure to identify the elements containing the job listing data. For example, a typical job listing might look like this in HTML:
<ol class="list-recent-jobs list-row-container menu"> <li> <h2 class="listing-company"> <span class="listing-company-name"> <a href="/jobs/7683/">Python Software Developer</a><br/> Maximum Information </span> <span class="listing-location">London, United Kingdom</span> </h2> <span class="listing-posted">Posted: <time datetime="2024-08-08T22:04:39.345328+00:00">08 August 2024</time></span> </li> </ol>
This structure shows that each job is inside an <li> element, which is contained within an ordered list (<ol>) with the class list-recent-jobs. Knowing this helps you target the correct elements in your scraper.
<li>
<ol>
list-recent-jobs
Once you understand the HTML structure you’re working with, it’s time to parse it and extract the job listings. Add the following code to the end of your app.py file:
soup = BeautifulSoup(html_content, 'html.parser') # Find the list of job postings jobs_list = soup.find('ol', class_='list-recent-jobs')
This code initializes Beautiful Soup with the HTML content and uses 'html.parser' to parse it. The find method locates the <ol> element with the class list-recent-jobs, which contains all the job postings.
'html.parser'
find
Once you’ve identified the job listings, you can loop through each job and extract the relevant details by adding the following code to the app.py file:
job_listings = [] for job in jobs_list.find_all('li'): job_title = job.find('a').text.strip() company = job.find('span', class_='listing-company-name').text.strip() location = job.find('span', class_='listing-location').text.strip() posted_date = job.find('time').text.strip() job_link = job.find('a')['href'] job_listings.append({ 'Job Title': job_title, 'Company': company, 'Location': location, 'Posted': posted_date, 'Link': job_link }) print(job_listings)
The preceding code extracts key details of the job listings from the page’s HTML, here is a breakdown of some of the key methods used earlier:
find_all('li')
job.find('a').text.strip()
job.find('span', class_='listing-company-name').text.strip()
job.find('span', class_='listing-location').text.strip()
job.find('time').text.strip()
job.find('a')['href']
The details are extracted, stored in a dictionary, and appended to a list for further processing. If you run this code using python app.py, it prints a list of all the jobs on the first page of the job board website. To scrape beyond this, you need to dynamically select the next page and retrieve it. This allows us to get all the jobs on the website. The next section will show you how to do this.
Now, let’s handle dynamic content, which in this case involves pagination. Many job boards paginate their listings. This means if you want to scrape all the jobs, you need to navigate through the different pages. To do this, you can check whether the page contains a Next button and extract the href from the Next button’s anchor (a) tag:
href
a
next_page = soup.find('li', class_='next') url = f'https://www.python.org/jobs{next_page.find("a")["href"]}' if next_page and next_page.find("a")["href"] else None
This code checks for a Next button at the bottom of the page. If it exists, the URL for the next page is extracted, and you can loop the extract process until there are no more pages. You can incorporate this into the existing codebase by looping until there is no Next page. To do so, open app.py and replace the existing code with this modified version:
import requests from bs4 import BeautifulSoup import csv # Initialize a list to store job details job_listings = [] # Initialize the URL for the job listings url = 'https://www.python.org/jobs/' while url: # Send a GET request to fetch the content of the page response = requests.get(url) html_content = response.text # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') # Find the list of job postings jobs_list = soup.find('ol', class_='list-recent-jobs') # Iterate through each job posting and extract relevant details for job in jobs_list.find_all('li'): job_title = job.find('a').text.strip() company = job.find('span', class_='listing-company-name').text.strip() location = job.find('span', class_='listing-location').text.strip() posted_date = job.find('time').text.strip() job_link = job.find('a')['href'] # Extract the job link # Append the job details to the list job_listings.append({ 'Job Title': job_title, 'Company': company, 'Location': location, 'Posted': posted_date, 'Link': job_link }) # Find the next page element next_page = soup.find('li', class_='next') # Check if the next page exists and get the URL url = f'https://www.python.org/jobs{next_page.find("a")["href"]}' if next_page and next_page.find("a")["href"] else None
Once you’ve collected all the jobs from the Python Job Board, you need to save them into a CSV file so you can process or analyze them later.
Add the following code to the end of app.py:
import csv with open('job_listings.csv', mode='w', newline='', encoding='utf-8') as csv_file: fieldnames = ['Job Title', 'Company', 'Location', 'Posted', 'Link'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() for job in job_listings: writer.writerow(job)
This code creates a CSV file named job_listings.csv. The DictWriter class writes the job data (stored as dictionaries) into the file.
job_listings.csv
DictWriter
To run your dynamic web scraper, open your terminal and enter python app.py
This command runs the scraper and then saves the job listings to a file job_listings.csv.
All the code for this tutorial is available in this GitHub repository.
In today’s data-driven world, web scraping is a powerful tool that allows you to gather data, conduct research, and automate processes. In this tutorial, you learned how to use Python to build a web scraper that’s capable of extracting job listings from both static and dynamic websites. Whether you’re looking to automate your job search, gather data for market research, or explore other applications, the tools and techniques you’ve learned will help you in your web scraping journey.
As you tackle more complex scraping challenges—like handling larger data sets or dealing with dynamic content—consider using reliable proxies to avoid IP blocks and CAPTCHAs. If you’re looking for the best proxies to pair with your Python web scraping projects, check out our curated list of top proxy services.
12 min read
Jonathan Schmidt
7 min read
Wyatt Mercer
10 min read