Session Cookies in Web Scraping: Authentication & Persistence
A session cookie is a small piece of data stored by the browser that identifies a user's session with a website. In web scraping, managing session cookies is essential for accessing pages behind login walls and maintaining authenticated state across multiple requests.
How Session Cookies Work
- 1.You log in to a website (POST username/password)
- 2.The server creates a session and sends back a cookie (e.g.,
session_id=abc123) - 3.Your browser sends this cookie with every subsequent request
- 4.The server recognizes you and serves authenticated content
Managing Sessions in Python
import requests
# requests.Session() automatically handles cookies
session = requests.Session()
# Login
session.post("https://example.com/login", data={
"username": "user@example.com",
"password": "secret123",
})
# Now all requests include the session cookie
profile = session.get("https://example.com/dashboard")
orders = session.get("https://example.com/orders")
# Both requests are authenticated
With Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
# Login
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "secret123")
page.click("button[type=submit]")
page.wait_for_url("**/dashboard")
# Save cookies for later use
cookies = context.cookies()
# Reuse cookies in a new session
new_context = browser.new_context()
new_context.add_cookies(cookies)
Common Cookie Challenges
- •CSRF tokens: Some sites require a CSRF token alongside the session cookie
- •Cookie expiration: Sessions expire — handle re-authentication
- •Secure/HttpOnly flags: Some cookies can't be read by JavaScript
- •Multiple cookies: Many sites use several cookies together
Anti-Bot and Cookies
Anti-bot systems like Cloudflare set their own cookies (e.g., cf_clearance). These cookies prove you passed their challenge. You need to:
- 5.Pass the challenge (in a headless browser)
- 6.Extract the cookies
- 7.Include them in subsequent requests