Python Web Scraping Guide

Step by step guide for web scraping with Python. Learn how to scrape both dynamic and static websites.

Web scraping Python image

Website data is useful for all kinds of things, including market research, price monitoring, lead generation, and analysis. However, not all websites provide APIs to access this data, which is where web scraping comes in. Web scraping allows you to programmatically extract useful information from websites, making it easy to quickly and efficiently gather large volumes of data.

In this tutorial, you’ll learn how to scrape both static and dynamic content using Python.


Python Web Scraping Basics

Essentially, scraping is like browsing the web programmatically: you fetch a page’s content, sift through the information, extract what you need, and store it for later use or analysis.

There are two types of content you can scrape: dynamic and static.

  • Static content is fixed on a web page. This means that once the page is loaded, the content doesn’t change.
  • Dynamic content appears or changes after the page loads (ie when you scroll down, more items load) and typically happens through JavaScript. Scraping dynamic content is more difficult because you need to be able to mimic a real user.

Python is a great language for web scraping because of its simplicity and the wide range of tools and libraries available. Following are some of the most popular Python libraries for web scraping. These libraries handle different aspects of the scraping process, from making web requests to parsing HTML and mimicking web browser sessions:

  • Requests and urllib3 are used to send HTTP requests and retrieve web pages. They’re great for collecting static content like product prices or listing information, but they retrieve only raw HTML and need to be paired with a parser like Beautiful Soup to navigate HTML structure and extract specific data elements.
  • Beautiful Soup helps parse HTML and XML. It’s great for extracting specific data from the HTML fetched by Requests or urllib3, and it is useful for tasks like lead generation and market research.
  • MechanicalSoup combines the capabilities of Requests and Beautiful Soup, and it enables you to interact with web forms and log in to websites. This is useful for scraping tasks where you need to retrieve data from membership sites or areas requiring user authentication.
  • Scrapy is a comprehensive framework for more complex tasks, such as crawling entire websites. Scrapy excels in large-scale projects, making it ideal for market analysis and in-depth research.
  • Selenium is a tool that helps automate browsers and is typically used to test web applications but is also highly effective for scraping websites that rely heavily on JavaScript to load content dynamically. Selenium allows you to programmatically mimic user behavior (ie button clicks, hovering, form submission, and scrolling).

Implementing a Web Scraper with Python

Now that you know a little more about web scraping in Python, it’s time to build a web scraper that collects job listings from the Python Job Board and saves them into a CSV file. This type of scraping can help automate your job search or conduct market research on the state of Python jobs (frequency, location, duties, salary, etc):

Architecture diagram, courtesy of Michael Nyamande

Before you begin, make sure you have the following:

  • The latest version of Python
  • An integrated development environment (IDE) of your choice, such as Visual Studio Code or PyCharm.

Set Up the Virtual Environment and Project

Before you start web scraping, you need to set up a virtual environment that will help you manage your dependencies and avoid conflicts with other installed packages. To do so, create a project directory where all your files will live by running the following command:

   mkdir python_job_scraper
   cd python_job_scraper

Then, in your project directory, create a virtual environment:

   python3 -m venv venv

And activate it:

source venv/bin/activate  # macOS/Linux
venv\Scripts\activate      # Windows

You also need to have the following Python packages installed:

  • requests to fetch web pages.
  • beautifulsoup4 to parse the HTML content.
  • csv to save data to a CSV. This comes preinstalled with Python.

You can install them using pip with the following command:

   pip install requests beautifulsoup4

Finally, create a new file in your project directory and name it app.py. This is where you’ll write your web scraping code.

Fetch the Website Content (Static Scraping)

Now that your environment is set up, it’s time to start writing your script to scrape static content from the job board website. Open the app.py file and begin by importing the necessary libraries and fetching the HTML content of the job board:

import requests
from bs4 import BeautifulSoup

# Initialize the URL for the job listings
url = 'https://www.python.org/jobs/'

# Send a GET request to fetch the content of the page
response = requests.get(url)
html_content = response.text
print(html_content)

This code uses the requests library to send a GET request to the job board URL. The response contains the HTML content of the page, which you’ll parse soon.

You can run the code and all other snippets in this tutorial by opening your terminal and entering the following command:

python app.py

This code prints the HTML content of the job board website. Before you parse the HTML, it’s important to understand its structure. To do so, open the Python Job Board page in your browser. Then, right-click on a job listing and select Inspect (or use Ctrl + Shift + I on Windows/Linux or Cmd + Opt + I on macOS):

Now, look at the HTML structure to identify the elements containing the job listing data. For example, a typical job listing might look like this in HTML:

<ol class="list-recent-jobs list-row-container menu">
    <li>
        <h2 class="listing-company">
            <span class="listing-company-name">
                <a href="/jobs/7683/">Python Software Developer</a><br/>
 Maximum Information
            </span>
            <span class="listing-location">London, United Kingdom</span>
        </h2>
        <span class="listing-posted">Posted: <time datetime="2024-08-08T22:04:39.345328+00:00">08 August 2024</time></span>
    </li>
</ol>

This structure shows that each job is inside an <li> element, which is contained within an ordered list (<ol>) with the class list-recent-jobs. Knowing this helps you target the correct elements in your scraper.

Parse the HTML Content

Once you understand the HTML structure you’re working with, it’s time to parse it and extract the job listings. Add the following code to the end of your app.py file:

soup = BeautifulSoup(html_content, 'html.parser')

# Find the list of job postings
jobs_list = soup.find('ol', class_='list-recent-jobs')

This code initializes Beautiful Soup with the HTML content and uses 'html.parser' to parse it. The find method locates the <ol> element with the class list-recent-jobs, which contains all the job postings.

Once you’ve identified the job listings, you can loop through each job and extract the relevant details by adding the following code to the app.py file:

job_listings = []

for job in jobs_list.find_all('li'):
    job_title = job.find('a').text.strip()
    company = job.find('span', class_='listing-company-name').text.strip()
    location = job.find('span', class_='listing-location').text.strip()
    posted_date = job.find('time').text.strip()
    job_link = job.find('a')['href']

    job_listings.append({
            'Job Title': job_title,
            'Company': company,
            'Location': location,
            'Posted': posted_date,
            'Link': job_link
    })
print(job_listings)

The preceding code extracts key details of the job listings from the page’s HTML, here is a breakdown of some of the key methods used earlier:

  • find_all('li') loops through each job listing.
  • job.find('a').text.strip() extracts the job title.
  • job.find('span', class_='listing-company-name').text.strip() extracts the company name.
  • job.find('span', class_='listing-location').text.strip() extracts the location.
  • job.find('time').text.strip() extracts the posted date.
  • job.find('a')['href'] extracts the job link.

The details are extracted, stored in a dictionary, and appended to a list for further processing. If you run this code using python app.py, it prints a list of all the jobs on the first page of the job board website. To scrape beyond this, you need to dynamically select the next page and retrieve it. This allows us to get all the jobs on the website. The next section will show you how to do this.

Handle Pagination (Dynamic Scraping)

Now, let’s handle dynamic content, which in this case involves pagination. Many job boards paginate their listings. This means if you want to scrape all the jobs, you need to navigate through the different pages. To do this, you can check whether the page contains a Next button and extract the href from the Next button’s anchor (a) tag:

next_page = soup.find('li', class_='next')
url = f'https://www.python.org/jobs{next_page.find("a")["href"]}' if next_page and next_page.find("a")["href"] else None

This code checks for a Next button at the bottom of the page. If it exists, the URL for the next page is extracted, and you can loop the extract process until there are no more pages. You can incorporate this into the existing codebase by looping until there is no Next page. To do so, open app.py and replace the existing code with this modified version:

import requests
from bs4 import BeautifulSoup
import csv

# Initialize a list to store job details
job_listings = []

# Initialize the URL for the job listings
url = 'https://www.python.org/jobs/'

while url:
    # Send a GET request to fetch the content of the page
    response = requests.get(url)
    html_content = response.text

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find the list of job postings
    jobs_list = soup.find('ol', class_='list-recent-jobs')

    # Iterate through each job posting and extract relevant details
    for job in jobs_list.find_all('li'):
        job_title = job.find('a').text.strip()
        company = job.find('span', class_='listing-company-name').text.strip()
        location = job.find('span', class_='listing-location').text.strip()
        posted_date = job.find('time').text.strip()
        job_link = job.find('a')['href']  # Extract the job link

        # Append the job details to the list
        job_listings.append({
            'Job Title': job_title,
            'Company': company,
            'Location': location,
            'Posted': posted_date,
            'Link': job_link 
        })

    # Find the next page element
    next_page = soup.find('li', class_='next')
    # Check if the next page exists and get the URL
    url = f'https://www.python.org/jobs{next_page.find("a")["href"]}' if next_page and next_page.find("a")["href"] else None

Once you’ve collected all the jobs from the Python Job Board, you need to save them into a CSV file so you can process or analyze them later.

Add the following code to the end of app.py:

import csv

with open('job_listings.csv', mode='w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['Job Title', 'Company', 'Location', 'Posted', 'Link']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()

    for job in job_listings:
        writer.writerow(job)

This code creates a CSV file named job_listings.csv. The DictWriter class writes the job data (stored as dictionaries) into the file.

Run Your Script

To run your dynamic web scraper, open your terminal and enter python app.py

This command runs the scraper and then saves the job listings to a file job_listings.csv.

All the code for this tutorial is available in this GitHub repository.


Conclusion

In today’s data-driven world, web scraping is a powerful tool that allows you to gather data, conduct research, and automate processes. In this tutorial, you learned how to use Python to build a web scraper that’s capable of extracting job listings from both static and dynamic websites. Whether you’re looking to automate your job search, gather data for market research, or explore other applications, the tools and techniques you’ve learned will help you in your web scraping journey.

As you tackle more complex scraping challenges—like handling larger data sets or dealing with dynamic content—consider using reliable proxies to avoid IP blocks and CAPTCHAs. If you’re looking for the best proxies to pair with your Python web scraping projects, check out our curated list of top proxy services.

arrow_upward