Home / Blog / Web Data / How to Get Data for Machine Learning
Learn how to get data for machine learning using public datasets, web scraping, and APIs. Clean, prepare, and store data efficiently for AI projects.
In machine learning (ML), computers are usually shown large amounts of data, in which they detect certain patterns and learn the relationships between variables. Given that, when training an ML model, you have to collect a large amount of data.
Since data is at the core of ML, high-quality datasets are essential for accurate ML models. In this article, you’ll learn how to gather data for your ML project.
Because the process of ML is impossible without data, collecting data is usually the first step in the ML project lifecycle. Following are some sources where you can gather the data you need.
In some cases, the necessary data may be readily available. You can find numerous datasets on websites like Kaggle; international bodies, such as the International Monetary Fund (IMF) or World Bank; or companies offering ready-to-use datasets.
Kaggle is a popular source of data for ML. It hosts thousands of public datasets, most of them in CSV format, which is widely used in ML. Downloading a data set is easy, but keep in mind that some of the datasets can be tens of gigabytes in size.
Once you register for a free account and sign in, you can navigate to the Datasets Hub, where you can choose numerous types of datasets. For instance, you could download the NFL Big Data Bowl 2025 data set, which is also an active competition that you can take part in. Once you’ve chosen a data set, go to its page, join the competition, click on the Download All button, and download a ZIP file:
The ZIP file contains CSV files with all the data you need to train an ML project on the NFL Big Data Bowl 2025 data set.
Kaggle offers numerous datasets, and most of them are used for educational purposes.
If you’re looking for datasets tailored to business needs, consider exploring commercial dataset marketplaces. You can find hundreds of prebuilt datasets from some of the web’s most popular websites, including Amazon, Walmart, X (Twitter), LinkedIn, and Instagram.
One of the advantages of these datasets is that they’re already formatted for ML projects. The accuracy and quality of the data are often guaranteed, which isn’t necessarily the case with datasets from open platforms like Kaggle. If you need data from e-commerce websites, real estate, or social media sites, these marketplaces offer a great place to start.
For example, you can leverage prebuilt Twitter datasets to get various data points about X posts or profiles. You can use this data for performing sentiment analysis, monitoring brand reputation, or locating influencers.
Another way to collect data is through APIs, which many organizations provide. Some notable examples include the YouTube API, Google API, and Reddit API. For example, the YouTube API provides easy access to various data points about videos, channels, and playlists (eg title, description, view count, likes, comments, and the upload date). You can also use the API to create playlists, add videos to them, edit settings, and more. The YouTube API is free, but it has a daily quota of 10,000 units.
Depending on the organization, APIs can either be free or paid. In any case, when you need data, it’s a good idea to check if it’s available through an API before turning to other sources for your data.
So far, we’ve focused on using prebuilt datasets or APIs to source data. However, in many cases, the specific data you need may not be readily accessible. You have to rely on web scraping to gather your data.
Web scraping allows you to extract data directly from web pages, regardless of how the information is structured. Later in this article, we’ll show you exactly how to do this.
Before you can start scraping data, there are a few things you need to keep in mind:
Additionally, when it comes to web scraping, it’s important to always keep in mind the terms and conditions of the website you’re scraping. You should always comply with the website’s rules regarding scraping its data. Virtually all websites have a robots.txt file, which tells web scrapers the web pages they can scrape and which pages are forbidden. This file can usually be found by appending /robots.txt to the root domain of a website. For example, you can find Amazon’s robots.txt file at https://www.amazon.com/robots.txt.
robots.txt
/robots.txt
Finally, you must decide on the tools you’ll use for web scraping. This largely depends on the complexity of the task.
For a small scraping task, a Python library such as Beautiful Soup will usually do the trick. However, for complex tasks, you may want a more advanced tool. One option is to use a managed web scraper API, which is highly scalable, compliant with various data protection laws (such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA)), easy to use without advanced coding skills, and capable of handling anti-scraping mechanisms.
Having covered potential data sources, let’s take a look at the steps you need to take to scrape a website to gather data for ML purposes.
In this example, you’ll scrape quotes from the Quotes to Scrape website, which is specifically created to be used in web scraping examples. This example uses Python, so make sure you have it installed if you want to follow along.
To start, set up a virtual environment. This isn’t mandatory, but it’s generally a good practice.
In Windows, you can create a new virtual environment by inputting python -m venv quotes in the command prompt. Then, you can activate it with the command quotes\Scripts\activate. If you’re successful, you’ll see the name of your new virtual environment, quotes, in the command prompt:
python -m venv quotes
quotes\Scripts\activate
quotes
If you’re activating a virtual environment in another operating system, you can refer to the instructions from “Python Virtual Environments: A Primer” for help.
Once you’re in the new virtual environment, you need to install the required libraries. Here, you’ll use Requests to send an HTTP request to the target website, Beautiful Soup to parse the HTML and extract the data we need, and pandas to clean and manipulate the data. All these can be installed with pip install requests beautifulsoup4 pandas.
pip install requests beautifulsoup4 pandas
After you’ve installed the required libraries, open Jupyter Notebook, open Google Colab, or create a new file named scrape.py. Then, import the dependencies with the following commands:
scrape.py
from bs4 import BeautifulSoup import requests import pandas as pd
Now that the scene has been set, the actual scraping can be performed. The web page you’ll scrape is http://quotes.toscrape.com/page/11/:
Using requests, you need to send an HTTP request for the desired web page and receive the response. Then, you can use Beautiful Soup to parse the received HTML file, making it easy to extract data from it later. To do so, run this code in your environment:
url = "https://quotes.toscrape.com/" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')
Once the HTML file is parsed, you need to analyze the structure of the HTML. From that, you can see the data points available for each item, such as the quote, its author, and its tags. The HTML for the first quote looks like this:
The same structure is used for the other quotes as well.
All the items are inside a <div> element with a class named quote. Given that, you can grab all the quotes with the following line of code:
<div>
class
quote
quotes = soup.find_all("div", class_="quote")
Now you need to loop through all the quotes; grab their text, author, and the first tag (you’re taking only the first tag to simplify the example); and store all those in separate lists. You can perform all this with the following code snippet:
#creating empty lists to store the scraped data texts = [] authors = [] tags = [] #loop through the quotes, extract the data, and store in a list for quote in quotes: texts.append(quote.find("span", class_="text").text) authors.append(quote.find("small", class_="author").text) tags.append(quote.find("a", class_="tag").text)
If you’re writing the code in a script, you can use the terminal or command prompt to navigate to the folder where the script is and run python scrape.py. If you’re using a Jupyter Notebook, simply run all the cells, if you haven’t done so already.
python scrape.py
At this point, you have separate Python lists for the different data points you’ve scraped. Later on, you’ll learn how you can take the information from those lists, clean it, and store it.
While what you’ve done until now is good for educational purposes, when you’re scraping a real website, you’ll probably face a few obstacles. A common issue you may run into is dynamic content. Dynamic content, which is often powered by JavaScript, refers to web content that changes automatically while you’re on the page. While this can be useful for users, it creates issues for web scrapers.
The issue with dynamic content is that requests can only fetch the static HTML returned by the server, but it can’t retrieve dynamic content rendered by JavaScript, which means you have to use another tool, like Selenium, when scraping dynamic content. In essence, Selenium enables you to control a web browser.
requests
Start by opening a new file like scrape_amazon.py or a new Jupyter Notebook. Proceed by installing and importing Selenium. If you’re using a Jupyter Notebook, you can perform the installations and the imports with the following piece of code:
scrape_amazon.py
!pip install selenium !pip install webdriver-manager from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait
If you created a scrape_amazon.py file, you can install these dependencies in a terminal or a command prompt. The only difference is that you should drop the ! at the beginning of the lines.
!
Once you’ve installed Selenium, let’s scrape Amazon’s product reviews to test it out.
To start, you need to identify a product so you can scrape its reviews. Here, you’ll scrape the number one best-selling Amazon product in electronics, which happens to be the Amazon Fire TV Stick.
After you’ve identified the product, set the URL to scrape (the reviews for the given product), use Selenium to initialize an instance of a web browser, and navigate to the desired web page. Finally, retrieve the HTML of the currently loaded page:
url = "https://www.amazon.com/product-reviews/B0CJM1GNFQ" driver = webdriver.Firefox() driver.get(url) html = driver.page_source
Then, utilize Beautiful Soup to parse the HTML and find all the reviews:
soup = BeautifulSoup(html) reviews = soup.find_all("div", class_="review")
Now it’s time to loop through all the reviews, extract the data points you’re interested in, and store them in a Python list. For the sake of this example, let’s scrape the review itself, the rating, the product configuration for which the review is given, and the review date. This can be done with the following code:
product_reviews = [] ratings = [] configurations = [] review_dates = [] for review in reviews: comment = review.find("span", class_="review-text") product_reviews.append(comment.text) rating = review.find("i") ratings.append(rating.text) configuration = review.find("a", class_="a-size-mini") configurations.append(configuration.text) review_date = review.find("span", class_="review-date") review_dates.append(review_date.text.split("on ")[1])
As you may have noticed, the Amazon review page shows only ten reviews per page. If there are more than ten product reviews, you need to go to the subsequent pages. That means you have to set up your code to see if there is a next page and, if there is, to follow it. Optionally, you can modify your code to wait for a few seconds before sending another request as you don’t want to send requests too rapidly. This can be done with the sleep() function from the time module, which you have to import using import time.
sleep()
time
import time
You can also limit the number of reviews you want to scrape as popular products can have thousands of reviews. Note that sometimes Amazon can ask you to log in. If this happens, you need to make a free Amazon account, log in, and rerun the code. The final code for all these steps looks like this:
url = "https://www.amazon.com/product-reviews/B0CJM1GNFQ" product_reviews = [] ratings = [] configurations = [] review_dates = [] while True: driver.get(url) html = driver.page_source soup = BeautifulSoup(html, "html.parser") reviews = soup.find_all("div", class_="review") for review in reviews: comment = review.find("span", class_="review-text") product_reviews.append(comment.text) rating = review.find("i") ratings.append(rating.text) configuration = review.find("a", class_="a-size-mini") configurations.append(configuration.text) review_date = review.find("span", class_="review-date") review_dates.append(review_date.text.split("on ")[1]) try: next_page_url = soup.find("li", class_="a-last").find("a")["href"] except: break else: url = "https://www.amazon.com" + str(next_page_url) if len(product_reviews) > 18: break time.sleep(10)
If you’ve been writing the code in a scrape_amazon.py file, you can open a terminal or a command prompt, navigate to the folder where the script is located, and run python scrape_amazon.py. If you’ve been using Jupyter Notebook or Google Colab, simply run all the cells. Just like in the Quotes to Scrape example, you have separate Python lists for all the aspects of the Amazon product you want to scrape.
python scrape_amazon.py
In the next section, you’ll learn what to do with the data in these Python lists.
Regardless of how sophisticated the hardware and algorithms are, without good data, the performance of the model will suffer. One of the most widely used phrases in ML is “garbage in, garbage out.” This reflects the fact that if you use bad data as the input, the trained ML model won’t be very useful. That’s why the process of cleaning and preparing data is so important.
If you’ve been following along, then the data should now be extracted. It’s time to clean it and prepare it for analysis. pandas, which is the most popular Python library for data analysis and manipulation, will be used.
To begin, you need to create a DataFrame. A DataFrame is essentially a table with rows and columns, and it is one of the basic data structures in pandas. Go ahead and create a DataFrame with the following line of code:
df = pd.DataFrame(list(zip(product_reviews, ratings, configurations, review_dates)), columns=["product_review", "rating", "configuration", "review_dates"])
With the command print(df.head(5)), you can see the first five instances of the data set you scraped:
print(df.head(5))
pandas is useful for manipulating and cleaning data. In the following code snippet, you have some basic cleaning steps, including removing empty rows, removing duplicates, and handling missing values:
#removing empty rows df.dropna(inplace=True) #removing duplicates df.drop_duplicates(inplace=True) #imputing missing values with the next observation df.fillna("bfill", inplace=True)
When scraping a website, you may see different peculiarities in the data. You should make an effort to clean those as well. For example, in the Amazon reviews data, you can see that the review text contains the newline character \n, which you probably don’t need for further analysis. You can also see that the rating is a whole string. It would make more sense if it were simply an integer. The following piece of code replaces all the \n characters with an empty string, and you keep only a numerical representation for the rating:
\n
df = df.map(lambda x: x.replace('\n', '') if isinstance(x, str) else x) df = df.rating.str.split(' ').str[0].astype(float).astype(int)
Now, if you run print(df.head(5)), you get a cleaner data set:
In general, you’ll want to export the scraped data for further analysis. Data can be exported and stored in numerous formats. One of the most commonly used formats is a CSV file. With pandas, exporting the scraped data into a CSV file is easy and uses only one line of code, like this:
df.to_csv("amazon_reviews.csv")
At this point, you can deactivate the virtual environment with the simple command: deactivate.
deactivate
With larger amounts of data, you can consider a cloud-based option like BigQuery or Amazon Simple Storage Service (Amazon S3).
When scraping the web, keep in mind that certain websites restrict the requests your scraper can send, effectively blocking your IP from scraping. To continue scraping, your best bet is to leverage rotating proxies. Because of the prevalence of anti-scraping mechanisms, proxies are virtually a must for any scraping at scale.
Different types of proxies are available, each with its own pros and cons. For web scraping, rotating proxies are the ideal choice as they provide a seamless way to avoid detection and maintain consistent access.
When it comes to proxies, choosing a reputable proxy provider with a large pool of rotating proxies is essential. A good proxy service typically offers millions of residential IPs, extensive geographical coverage, high network uptime, fast speeds, and high success rates. You can choose among residential, ISP, datacenter, or mobile proxies.
Good datasets are crucial for any ML project. Given that, you have to initially curate a data set before leveraging ML.
In this article, you learned what the different sources of data are, what the prerequisites are for collecting data from the web, how to collect the data, how to clean and prepare the data, and how to store it as a CSV file. You also learned the importance of proxies in web scraping.
If you decide to gather your data via web scraping, it’s beneficial to use reliable proxy providers and managed scraping services. With features such as high-quality rotating proxies, automated CAPTCHA solving, web scraping APIs, and prebuilt datasets, these solutions offer powerful ways to simplify large-scale data collection for ML.
Get started today to access high-quality datasets and make your ML projects more efficient and scalable.
Looking for a data provider? Read our Business Data Providers guide.
9 min read
Jonathan Schmidt
8 min read
4 min read