HTTP Requests in Web Scraping: GET, POST, Headers & More
An HTTP request is a message sent from a client (your scraper) to a web server asking for a resource. In web scraping, you send HTTP requests to fetch web pages, then parse the response HTML to extract data.
How HTTP Requests Work in Scraping
When your browser visits a page, it sends an HTTP GET request. The server responds with the HTML. Your scraper does the same thing, just without rendering the page visually.
import requests
# Basic GET request
response = requests.get("https://example.com/products")
print(response.status_code) # 200
print(response.text) # HTML content
# With headers (to look like a real browser)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get("https://example.com/products", headers=headers)
GET vs. POST
- •GET: Fetches a page. Parameters go in the URL. Used for most scraping.
- •POST: Sends data to the server. Used for login forms, search forms, and API endpoints.
# POST request (e.g., login)
data = {"username": "user", "password": "pass"}
response = requests.post("https://example.com/login", data=data)
Important Headers for Scraping
- •User-Agent: Identifies your client. Set this to mimic a real browser.
- •Accept: What content types you accept (
text/html,application/json) - •Referer: The page you "came from" — some sites check this
- •Cookie: Session cookies for authenticated scraping
Sessions and Cookies
Use requests.Session() to persist cookies across multiple requests — essential for scraping behind login pages.
session = requests.Session()
session.post("https://example.com/login", data={"user": "me", "pass": "secret"})
# Now all requests in this session include the login cookies
response = session.get("https://example.com/dashboard")
Status Codes to Watch For
- •200: Success
- •301/302: Redirect (requests follows these automatically)
- •403: Forbidden — the site is blocking you
- •429: Too many requests — you're being rate limited
- •503: Server overloaded or blocking bots