Skip to main content
Back to Blog
Python Requests for Web Scraping: Headers, Sessions & Cookies
11 min readby Nabeel

Python Requests for Web Scraping: Headers, Sessions & Cookies

pythonrequestsbeginner

The requests library is where most Python web scraping starts. Before you reach for Playwright or Scrapy, you should know how to make HTTP requests properly — with sessions, headers, cookies, and error handling.

This guide covers everything you need to use requests effectively for scraping.

Basic GET and POST Requests

python
import requests

# GET request — fetching a page response = requests.get("https://httpbin.org/get") print(response.status_code) # 200 print(response.text) # the response body

# POST request — submitting data response = requests.post("https://httpbin.org/post", data={"key": "value"}) print(response.json()) # parsed JSON response

Most scraping uses GET. You'll use POST when submitting forms or interacting with APIs that expect it.

Setting Headers and User Agents

Bare requests without headers are the easiest way to get blocked. Every request you send has a default user agent that screams "I'm a script."

python
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
}

response = requests.get("https://example.com", headers=headers)

At minimum, always set a realistic User-Agent. The other headers make your requests look more like a real browser.

Using Sessions for Cookies

A Session object persists cookies across requests — exactly like a browser does. This is essential for sites that require login or track state.

python
session = requests.Session()

# First request sets cookies session.get("https://example.com")

# Subsequent requests automatically include those cookies response = session.get("https://example.com/dashboard")

# You can also set default headers for the session session.headers.update({ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/125.0.0.0", })

# All requests through this session now use these headers response = session.get("https://example.com/api/data")

Sessions also reuse TCP connections, making multiple requests to the same host faster.

Handling Redirects

By default, requests follows redirects automatically. Sometimes you want to control this.

python
# Follow redirects (default behavior)
response = requests.get("https://example.com/old-page")
print(response.url)  # shows the final URL after redirects

# Disable redirects to inspect them manually response = requests.get("https://example.com/old-page", allow_redirects=False) print(response.status_code) # 301 or 302 print(response.headers["Location"]) # where it wants to redirect

Timeouts and Retries

Never make a request without a timeout. Without one, your scraper can hang forever on an unresponsive server.

python
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Simple timeout response = requests.get("https://example.com", timeout=10) # 10 seconds

# Retry strategy for production scrapers session = requests.Session() retries = Retry( total=3, # retry up to 3 times backoff_factor=1, # wait 1s, 2s, 4s between retries status_forcelist=[429, 500, 502, 503, 504], ) session.mount("https://", HTTPAdapter(max_retries=retries)) session.mount("http://", HTTPAdapter(max_retries=retries))

# This will automatically retry on server errors response = session.get("https://example.com/api/data", timeout=10)

The retry adapter handles flaky servers and rate limiting automatically. The backoff_factor adds exponential delays between retries.

POST Requests for Form Submission

Some sites require form submissions to access data. Use POST with the form fields:

python
# Form data (application/x-www-form-urlencoded)
response = requests.post("https://example.com/search", data={
    "query": "python web scraping",
    "page": 1,
})

# JSON data (application/json) — common for APIs response = requests.post("https://example.com/api/search", json={ "query": "python web scraping", "filters": {"category": "tutorials"}, })

Use data= for traditional form submissions and json= for API endpoints.

Downloading Files

Downloading images, PDFs, or other files is straightforward:

python
# Download a file
response = requests.get("https://example.com/report.pdf", stream=True)

with open("report.pdf", "wb") as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk)

The stream=True parameter prevents loading the entire file into memory. Important for large files.

Response Handling

Different endpoints return different formats. Here's how to handle each:

python
response = requests.get("https://example.com/page")

# HTML content — pass to BeautifulSoup html = response.text

# JSON response — parse directly data = response.json()

# Binary content (images, PDFs) binary = response.content

# Check encoding print(response.encoding) # utf-8, ISO-8859-1, etc.

# Force encoding if auto-detection fails response.encoding = "utf-8" html = response.text

PropertyReturnsUse Case
.textString (decoded)HTML pages
.json()Dict/ListAPI responses
.contentBytesFiles, images
.status_codeIntegerError checking
.headersDictContent-type, cookies

Error Handling Patterns

Production scrapers need proper error handling. Here's the pattern I use:

python
import requests
import time

def fetch_page(url, session, max_retries=3): """Fetch a URL with error handling and manual retry logic.""" for attempt in range(max_retries): try: response = session.get(url, timeout=10) response.raise_for_status() # raises exception for 4xx/5xx return response except requests.exceptions.HTTPError as e: if response.status_code == 429: wait = 2 ** attempt # exponential backoff print(f"Rate limited. Waiting {wait}s...") time.sleep(wait) else: print(f"HTTP error {response.status_code} for {url}") return None except requests.exceptions.ConnectionError: print(f"Connection failed for {url}. Retrying...") time.sleep(1) except requests.exceptions.Timeout: print(f"Timeout for {url}. Retrying...") return None

Always use raise_for_status() to catch HTTP errors. It's easy to miss a 403 or 500 if you only check for exceptions.

What's Next

The requests library handles 80% of scraping tasks. For JavaScript-rendered pages, you'll need Playwright. For large-scale scraping, you'll want proxy rotation and concurrent requests.

The Master Web Scraping course builds on these fundamentals with real-world projects that put them all together.

Want the full course?

This blog post is just a taste. The Master Web Scraping course covers 16 in-depth chapters from beginner to expert.

Get Instant Access — $19

$ need_help?

We're here for you