Home / Blog / Web Scraping / Best HTTP Headers for Web Scraping
Learn how to enhance web scraping with HTTP headers like User-Agent, Accept-Language, and Cookies. Avoid detection and optimize your scraping efficiency.
HTTP headers are key-value pairs sent alongside HTTP requests to communicate with the server. If set properly, HTTP headers make your scraper appear more human, reducing the chance of being flagged and blocked.
In this article, you’ll learn how to use headers like User-Agent, Accept-Language, Referer, and Cookie to improve the success rate and efficiency of your web scraping projects.
User-Agent
Accept-Language
Referer
Cookie
HTTP headers are an essential part of HTTP protocol because they inform the server about the request sender. Because of that, when scraping the web, you should take extra care configuring your headers; poorly configured HTTP headers are one of the first signs that something is wrong with the request. A good rule of thumb is to make HTTP requests appear as human as possible. You’d usually do so by appropriately setting and rotating your headers with each new request.
Each header you set serves a specific purpose, from mimicking browser behavior to maintaining session states. Let’s take a look at some of the best HTTP headers for web scraping.
The User-Agent header contains information about the browser, operating system, and device making the request. Even though this information might seem simple and unimportant, many websites use User-Agent parsers to analyze User-Agent headers of incoming requests. If a server detects any type of unusual pattern in the User-Agent headers, such as identical or fake headers across multiple requests, it can flag and eventually block the faulty requests.
To help ensure you won’t be flagged or blocked, it’s recommended that you rotate User-Agent values for each request. You can also mix desktop and mobile strings to emulate natural browsing patterns, making it harder for websites to detect automated behavior.
To rotate User-Agent headers for each request, maintain a list of genuine User-Agent strings from various browsers and devices. Then, rotate through those each time you’re sending a request using the random.choice() method:
random.choice()
import random import requests user_agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1', # Add more User-Agent strings here ] # Pick a random User-Agent string random_user_agent = random.choice(user_agents)
The random.choice() picks a random User-Agent string from the user_agents list, which you can use to set a value of the User-Agent header for your request:
user_agents
# Set the value for the User-Agent header headers = {'User-Agent': random_user_agent}
Now, you can send an HTTP request using the configured header. Here’s an example where you’re sending a GET request to a simple HTTP request and response service, httpbin.org/headers:
response = requests.get('http://httpbin.org/headers', headers=headers)
The request.get() method returns the requests.Response object that contains the server’s response to the request you sent. In this particular case, it returns the HTTP headers of your request as a response:
request.get()
requests.Response
content = response.text print(content)
Your output will look like this:
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "X-Amzn-Trace-Id": "Root=1-6717771f-5d8935cb6fa374011077878d" } }
Take a look at the "User-Agent" section in this code. The value is one randomly chosen User-Agent string you specified in the user_agents list.
"User-Agent"
For simplicity’s sake, this code doesn’t handle any potential errors. All production-ready code should do that. Additionally, make sure you keep your User-Agent list updated as browser versions change frequently. Current User-Agent strings are available from sources like UserAgentString.com.
The Accept-Language header tells the server which languages the client can handle in the response. That information is a crucial part of the mechanism called language negotiation—a process that enables a server to deliver content in a language that matches the user’s preferences or region.
Even though it’s obvious, the Accept-Language isn’t the only factor used to determine what language is chosen. Servers typically use multiple other factors, such as the IP address from which the request was sent, browser settings, and cookies, to determine the preferred language. A mismatch between your declared language preferences and the geographic location of your IP address can raise red flags. For instance, if you’re scraping from an IP in Germany but your Accept-Language header is set to English, it can look suspicious. It’s recommended that you always align this header with the country of your proxy or VPN.
To configure the Accept-Language header, you need to provide a list of languages, each containing a specific language code and the q value (more on that later):
q
import random import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', # Here's the configuration of the Accept-Language header 'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7' }
In this example, de-De and de are language codes for German, which has the q value of 0.9. Meanwhile, en-US is the code for English (United States), and en is for English in general. They have the q value of 0.8 and 0.7, respectively.
de-De
de
0.9
en-US
en
0.8
0.7
The q values indicate preferential order—the higher the q value, the more it’s preferred. For instance, the previous example sets German as the preferred language, with English as a backup.
Once the Accept-Language header is set up, you can send the request as usual:
response = requests.get('http://httpbin.org/headers', headers=headers) content = response.text print(content)
Since http://httpbin.org/headers returns a JSON representation of the headers sent alongside your request, sending this request results in the following:
http://httpbin.org/headers
{ "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1", "X-Amzn-Trace-Id": "Root=1-67184217-759391e6195a85946d99aff3" } }
Note how the Accept-Language header was set exactly how you configured it in the scraping code.
The goal of using the Accept-Language header is to make your requests look as natural as possible. If you’re scraping a multinational site, consider rotating your Accept-Language header along with your proxy locations.
The Accept-Encoding header informs the server about the compression methods your client supports, such as gzip, deflate, or br. When the server knows what compression methods are supported by the client, it can reduce the response size by compressing it into one of the supported formats. Compressed data uses less memory, so sending a compressed response improves transfer speed.
Accept-Encoding
gzip
deflate
br
Here’s how you can implement the Accept-Encoding header to accept gzip, deflate, and Brotli (br) compressions:
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br' }
The requests library in Python handles decompression automatically, so you don’t need to worry about it. You just need to send a request and treat the response as usual:
requests
response = requests.get('http://httpbin.org/headers', headers=headers) # requests library automatically decompresses the content content = response.text print(content)
It’s up to the server to decide whether it will compress the response data or not, but you won’t be able to see it from the response. You’ll get the JSON representation of the headers sent alongside the request you sent.
The Referer header indicates the URL of the previous web page where the current request originated. This helps servers track users’ navigation paths as they move between pages.
A properly configured Referer header makes requests look like they are part of natural browsing behavior. Websites often check this header to ensure the request is coming from a legitimate source, like a search engine or another page within the same site.
To mimic real user behavior, it’s best to set the Referer header to logical sources. That’s done by dynamically setting the Referer header based on the logical source of the current request. For example, if you’re scraping products from an e-commerce website, you first scrape each product category and then each product within the scraped categories.
When accessing category pages, which are commonly accessed from search engines, you can set a search engine as the Referrer:
Referrer
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', # Category pages accessed from Google 'Referer': 'https://google.com' } categories_to_scrape = [ 'https://example.com/category-1', 'https://example.com/category-2', 'https://example.com/category-3', ] for url in categories_to_scrape: response = requests.get(url, headers=headers) content = response.text print(content)
With each category properly scraped, you can scrape products within each category.
Say you’re scraping products from Category 1. Their pages are usually accessed from the category page, so the Referer should be set appropriately:
Category 1
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', # Product pages accessed for the Category 1 page 'Referer': 'https://example.com/category-1' } products_to_scrape = [ 'https://example.com/product-1', 'https://example.com/product-2', 'https://example.com/product-3', ] for url in products_to_scrape: response = requests.get(url, headers=headers) content = response.text print(content)
Unfortunately, things aren’t usually this simple. Category pages are typically paginated, and each successive page is accessed from the previous page. To accommodate that, you should keep track of the current URL, which has to be set as a Referer when sending a request to the next page:
import requests # Starting URL previous_url = 'https://example.com/category-page-1' # Simulating the scraping workflow urls_to_scrape = [ 'https://example.com/category-page-2', 'https://example.com/category-page-3', 'https://example.com/category-page-4', ] for url in urls_to_scrape: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Referer': previous_url } response = requests.get(url, headers=headers) content = response.text print(content) # Set the current URL as the referer for the next page previous_url = url
The Cookie header carries data that the server has previously sent to the client. It stores session data, user preferences, and other vital information that websites use to recognize and track visitors.
When you visit a website, cookies are stored in your browser and sent back to the server with each request, helping the server recognize you. There are many types of cookies, but the following are several that you should be aware of when scraping the web:
jwt_token
auth_token
sessionid
The Cookie header is especially important when you’re scraping websites that require login or use session-based navigation. Without the correct cookies, your scraper may not be able to maintain a session or could get blocked from accessing certain content.
To use the Cookie header when scraping websites, you should capture cookies from a legitimate browser session and include them in your scraper’s requests. This ensures that your scraper behaves like a real user:
import requests session = requests.Session() # Login to the site login_data = {'username': 'your_username', 'password': 'your_password'} session.post('https://example.com/login', data=login_data) # Now you can make authenticated requests response = session.get('https://example.com/protected-page') print(response.text)
This approach automatically handles cookies for you. The Session object stores cookies received from the server and sends them back in subsequent requests.
Session
You might need to manually set or modify cookies for more complex scenarios. However, in any real-world situation, you should handle user credentials securely. Never hard-code sensitive information in your script:
cookies = {'session_id': 'abc123', 'user_prefs': 'dark_mode'} response = requests.get('https://example.com', cookies=cookies)
Remember, cookie policies can be complex. Some sites use secure, HTTP-only cookies that can’t be set via JavaScript. Some sites use anti-bot measures to detect inconsistencies in cookie handling, like expired or unrelated cookies, missing session cookies, or lack of CSRF token management. Any suspicious requests are promptly blocked by the server.
HTTP headers like User-Agent, Accept-Language, Accept-Encoding, Referer, and Cookie can significantly improve your web scraping. They help your requests appear more genuine, minimizing the chances of being blocked by anti-bot mechanisms.
10 min read
Wyatt Mercer
9 min read
Ben Keane
Jonathan Schmidt