C++ Web Scraping Guide

Step by step guide to web scraping with C++.

C++ web scraping guide image

In the modern data-driven world, web scraping is an indispensable tool for collecting real-life data from websites without worrying about the availability of APIs.

In this article, you’ll learn how to write a web scraper in C++. You’ll use cpr and libxml2 to scrape a static site and Selenium to scrape a dynamic site. You’ll also learn how to use proxies in your scrapers.


Implementing Web Scraping in C++

To follow along with this tutorial, you need the following:


Scraping a Static Site

To scrape a static site, you’ll use the cpr library to make HTTP requests and libxml2 to parse the response and extract the data.

In this example, you’ll scrape all the text and author names from the Quotes to Scrape website, which is a sandbox website designed to help you practice web scraping.

Note: If you’re trying to scrape a real-life website, make sure to look at its terms and conditions to avoid any potential legal ramifications.

Before scraping any content, it’s a good idea to understand the structure of the HTML web page to learn how to extract the desired data. Open the Quotes to Scrape website in your browser and press CTRL + Shift + I to bring up the developer console.

You’ll find the quotes are wrapped in div with the class quote. The body of each quote is inside a span with class text, and the author of each quote is inside a small with class author:

The structure of the Quotes to Scrape web page

Create a directory in your machine for the project:

mkdir cpp-scraper
cd cpp-scraper

Then, create a CMakeLists.txt file where you define how to build the C++ project. You’ll use CMake as the build tool, so if you’re not familiar with CMake syntax, make sure you review the docs.

Paste the following code in CMakeLists.txt:

cmake_minimum_required(VERSION 3.5)
project(scraper VERSION 0.1.0)

include(FetchContent)
FetchContent_Declare(cpr GIT_REPOSITORY https://github.com/libcpr/cpr.git
                         GIT_TAG 3b15fa82ea74739b574d705fea44959b58142eb8)

FetchContent_MakeAvailable(cpr)

add_executable(static-scraper static-scraper.cpp)
target_compile_features(static-scraper PRIVATE cxx_std_20)

target_link_libraries(static-scraper PRIVATE cpr::cpr)

find_package(LibXml2 REQUIRED)
target_link_libraries(static-scraper PRIVATE LibXml2::LibXml2)

This file declares the project scraper with one executable named static-scraper, which is compiled from static-scraper.cpp. The FetchContent_Declare line fetches the cpr library from GitHub, and the FetchContent_MakeAvailable call makes the library available for linking, which is done using target_link_libraries. Finally, you use find_package to find the LibXML2 package and link your executable against it.

Create a file named static-scraper.cpp and include the following required headers:

#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
#include <format>

Then, create the main function where you write the logic for the scraper:

int main(int argc, char** argv) {

    return 0;
}

Inside the main function, make an HTTP GET request to https://quotes.toscrape.com/ using cpr and, if the response is not successful, abort:

cpr::Response r = cpr::Get(cpr::Url{"https://quotes.toscrape.com/"});
if (r.status_code != 200) {
    std::cerr << "Failed to fetch the page" << std::endl;
    return 1;
}

Next, use the htmlReadMemory function of LibXML2 to convert the HTML response into an htmlDocPtr:

htmlDocPtr doc = htmlReadMemory(r.text.c_str(), r.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
if (doc == nullptr) {
    std::cerr << "Failed to parse the page" << std::endl;
    return 1;
}

Note: When dealing with pointers, it’s a good idea to always check that they’re not null before attempting to use them. You’ll see this check being performed at almost every step, but it’s not explicitly called out in the article anymore.

Now, it’s time to use XPath expressions to extract data from the doc. Create an XPath context from the document:

xmlXPathContextPtr context = xmlXPathNewContext(doc);
if (context == nullptr) {
    std::cerr << "Failed to create XPath context" << std::endl;
    return 1;
}

Then, to extract all the div with the class quote, use the XPath expression //div[@class='quote']:

xmlXPathObjectPtr quotes = xmlXPathEvalExpression((xmlChar *) "//div[@class='quote']", context);
if (quotes == nullptr) {
    std::cerr << "Failed to evaluate XPath expression" << std::endl;
    return 1;
}

Iterate over each quote, extract the text and author, and print them:

for (int i = 0; i < quotes->nodesetval->nodeNr; i++) {
    xmlNodePtr quote = quotes->nodesetval->nodeTab[i];
    xmlXPathSetContextNode(quote, context);
    xmlNodePtr author = xmlXPathEvalExpression((xmlChar *) ".//small[@class='author']", context)->nodesetval->nodeTab[0];
    if (author == nullptr) {
        std::cerr << "Failed to evaluate XPath expression" << std::endl;
        return 1;
    }
    xmlNodePtr text = xmlXPathEvalExpression((xmlChar *) ".//span[@class='text']", context)->nodesetval->nodeTab[0];
    if (text == nullptr) {
        std::cerr << "Failed to evaluate XPath expression" << std::endl;
        return 1;
    }
    std::string author_text = reinterpret_cast<const char *>(xmlNodeGetContent(author));
    std::string quote_text = reinterpret_cast<const char *>(xmlNodeGetContent(text));

    std::cout << std::format("{}: {}", author_text, quote_text) << std::endl;
}

The whole code looks like this:

#include <iostream>
#include "cpr/cpr.h"
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
#include <format>

int main(int argc, char** argv) {
    cpr::Response r = cpr::Get(cpr::Url{"https://quotes.toscrape.com/"});
    if (r.status_code != 200) {
        std::cerr << "Failed to fetch the page" << std::endl;
        return 1;
    }

    htmlDocPtr doc = htmlReadMemory(r.text.c_str(), r.text.length(), nullptr, nullptr, HTML_PARSE_NOWARNING | HTML_PARSE_NOERROR);
    if (doc == nullptr) {
        std::cerr << "Failed to parse the page" << std::endl;
        return 1;
    }

    xmlXPathContextPtr context = xmlXPathNewContext(doc);
    if (context == nullptr) {
        std::cerr << "Failed to create XPath context" << std::endl;
        return 1;
    }

    xmlXPathObjectPtr quotes = xmlXPathEvalExpression((xmlChar *) "//div[@class='quote']", context);
    if (quotes == nullptr) {
        std::cerr << "Failed to evaluate XPath expression" << std::endl;
        return 1;
    }

    for (int i = 0; i < quotes->nodesetval->nodeNr; i++) {
        xmlNodePtr quote = quotes->nodesetval->nodeTab[i];
        xmlXPathSetContextNode(quote, context);
        xmlNodePtr author = xmlXPathEvalExpression((xmlChar *) ".//small[@class='author']", context)->nodesetval->nodeTab[0];
        if (author == nullptr) {
            std::cerr << "Failed to evaluate XPath expression" << std::endl;
            return 1;
        }
        xmlNodePtr text = xmlXPathEvalExpression((xmlChar *) ".//span[@class='text']", context)->nodesetval->nodeTab[0];
        if (text == nullptr) {
            std::cerr << "Failed to evaluate XPath expression" << std::endl;
            return 1;
        }
        std::string author_text = reinterpret_cast<const char *>(xmlNodeGetContent(author));
        std::string quote_text = reinterpret_cast<const char *>(xmlNodeGetContent(text));

        std::cout << std::format("{}: {}", author_text, quote_text) << std::endl;
    }
    return 0;
}

To compile the code, create a build directory and move into it:

mkdir build
cd build

Generate the build files using CMake:

cmake ..

And build the executable:

make -j4

Finally, run the scraper:

./static-scraper

You should get the following output:

Albert Einstein: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
J.K. Rowling: "It is our choices, Harry, that show what we truly are, far more than our abilities."
Albert Einstein: "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
Jane Austen: "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
Marilyn Monroe: "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
Albert Einstein: "Try not to become a man of success. Rather become a man of value."
André Gide: "It is better to be hated for what you are than to be loved for what you are not."
Thomas A. Edison: "I have not failed. I've just found 10,000 ways that won't work."
Eleanor Roosevelt: "A woman is like a tea bag; you never know how strong it is until it's in hot water."
Steve Martin: "A day without sunshine is like, you know, night."

Scraping a Dynamic Website

The scraper you just wrote works well for static sites, but your scraper can’t execute JavaScript, which means it can’t scrape dynamic websites that load data using JavaScript. To scrape dynamic websites, you need a browser automation tool, such as Selenium, that can launch a browser and run the scraper inside it.

Unfortunately, Selenium does not have official bindings for C++, so you’ll use the Selenium Server and the webdriverxx library to pass commands to the server. The current version of the library runs into an issue with the latest version of the Selenium Server, so you’ll use an older commit and version 4.0.0 of the Selenium Server, which is the last version reported to be working.

To follow along, you need to install the latest version of Java on your computer. You also need a WebDriver and its corresponding browser. This article uses Geckodriver and Firefox, but you can use whatever you prefer.

In this example, you’ll scrape the Oscar Winning Films page, which is another sandbox site. If you open the page and click any of the buttons, the table of Oscar-winning films in that year is loaded using Ajax and JavaScript. The film names are stored in td elements with the class film-title:

Structure of the film data in HTML code

Add another FetchContent_Declare line in CMakeLists.txt to fetch the webdriverxx library. Then, modify the FetchContent_MakeAvailable line to include webdriverxx:

FetchContent_Declare(webdriverxx GIT_REPOSITORY https://github.com/GermanAizek/webdriverxx
                         GIT_TAG 0b04c449b6f187ecff67addaf1a22ae23a43afd9
                         GIT_SUBMODULES "")

FetchContent_MakeAvailable(cpr webdriverxx) # Add webdriverxx here

At the end of the file, paste the following code that declares another executable and links it with libcurl and webdriverxx:

add_executable(dynamic-scraper dynamic-scraper.cpp)
target_compile_features(dynamic-scraper PRIVATE cxx_std_20)

find_package(CURL REQUIRED)
include_directories(${CURL_INCLUDE_DIR})
link_directories(${CURL_LIBRARIES})

target_include_directories(dynamic-scraper PRIVATE "${webdriverxx_SOURCE_DIR}/src/include")

target_link_libraries(dynamic-scraper PRIVATE ${CURL_LIBRARIES})

Create a file named dynamic-scraper.cpp and include the libraries:

#include "webdriverxx.h"
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
using namespace webdriverxx;

In the main function, start an instance of Firefox:

int main() {
    WebDriver ff = Start(Firefox());

    return 0;
}

Navigate to the Oscar Winning Films web page and find the 2015 button. Click it and wait five seconds for the table to load:

ff.Navigate("https://www.scrapethissite.com/pages/ajax-javascript");
ff.FindElement(ById("2015")).Click();
std::this_thread::sleep_for(std::chrono::seconds(5));

Then, find the elements with class film-title and print their contents:

auto find_films = [&]() { return ff.FindElements(ByClass("film-title")); };
std::vector<Element> films = WaitForValue(find_films);
std::cout << "Found " << films.size() << " films" << std::endl;
for (Element film : films) {
    std::cout << film.GetText() << std::endl;
}

The whole code looks like this:

#include "webdriverxx.h"
#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
using namespace webdriverxx;

int main() {
    WebDriver ff = Start(Firefox());
    ff.Navigate("https://www.scrapethissite.com/pages/ajax-javascript");
    ff.FindElement(ById("2015")).Click();
    std::this_thread::sleep_for(std::chrono::seconds(5));
    auto find_films = [&]() { return ff.FindElements(ByClass("film-title")); };
    std::vector<Element> films = WaitForValue(find_films);
    std::cout << "Found " << films.size() << " films" << std::endl;
    for (Element film : films) {
        std::cout << film.GetText() << std::endl;
    }
    return 0;
}

Download the Selenium Server and run it with Java:

java -jar selenium-server-4.0.0.jar standalone

Keep this terminal open.

In another terminal, make sure you’re in the build directory and compile the project:

cmake .. && make -j4

Then, run the scraper:

./dynamic-scraper

An instance of Firefox will launch, and after a while, you get the following output:

Found 16 films
Spotlight
Mad Max: Fury Road
The Revenant
Bridge of Spies
The Big Short
The Danish Girl
Room
Ex Machina
The Hateful Eight
Inside Out
Amy
Bear Story
A Girl in the River: The Price of Forgiveness
Son of Saul
Spectre
Stutterer

Using a Proxy for Web Scraping

There are many different scenarios where you’d need to use proxies for web scraping. A proxy hides your IP address from the website, which helps protect your privacy and avoid IP bans. A proxy can also give you access to geoblocked or restricted content.

To use a proxy, you need access to a proxy server. For best results, you can buy a premium proxy, but for testing purposes, you can pick a free proxy from the Free Proxy List site. You need the IP address and port of the proxy server from the list.

To use a proxy with cpr, simply pass the proxy details using cpr::Proxies in the cpr::Get call:

cpr::Response r = cpr::Get(cpr::Url{"https://quotes.toscrape.com/"}, cpr::Proxies{{"http", "http://IP_ADDRESS:PORT"}});

If your proxy server requires authentication, pass the credentials using cpr::ProxyAuthentication:

cpr::Response r = cpr::Get(cpr::Url{"https://quotes.toscrape.com/"}, cpr::Proxies{{"http", "http://IP_ADDRESS:PORT"}}, cpr::ProxyAuthentication{{"http", EncodedAuthentiction{"USERNAME", "PASSWORD"}}});

To use a proxy with webdriverxx, use SetProxy to pass the proxy details:

WebDriver ff = Start(Firefox().SetProxy(
    HttpProxy("IP_ADDRESS:PORT")
        .SetUsername("USERNAME")
        .SetPassword("PASSWORD")
));

That’s all you need to use proxies. You can compile and run the codes as shown before. You get the same outputs, but this time, your requests are routed through the proxy server.

You can find all the code for this article on GitHub.


Conclusion

In this article, you learned how to write a web scraper with C++. You learned how to integrate cpr and libxml2 to build a static website scraper and how to use Selenium to write a dynamic website scraper. With this knowledge, you can create more advanced scrapers that can take on real-life websites. Various scenarios, such as data mining and competitor analysis, require fast and efficient web scrapers, which you can write with C++.

This article also showed you how to add proxies to your C++ web scraper. Proxies are vital for protecting your privacy online. They can act as guards and cover your real IP address. They also help you circumvent IP bans and geoblocking.

Looking for the best proxy providers? Read our Best Proxy Providers review.

arrow_upward