How to Bypass Anti-Bot Detection: Cloudflare, DataDome & More
If you've tried scraping anything beyond simple blogs, you've run into anti-bot detection. The 403 Forbidden response, the Cloudflare challenge page, the empty response that loads fine in your browser but breaks in your script.
This guide covers how modern anti-bot systems work and how to get past them.
How Anti-Bot Systems Detect Scrapers
Anti-bot systems combine multiple signals to tell humans apart from bots. Here's what they look at.
1. TLS Fingerprinting
This is the most important detection method in 2026, and most tutorials still don't cover it.
When your scraper connects to a website via HTTPS, it performs a TLS handshake. The specific ciphers, extensions, and parameters your client sends create a unique "fingerprint." Python's requests library has a TLS fingerprint that looks nothing like a real browser.
The fix: use curl_cffi, which impersonates real browser TLS fingerprints:
from curl_cffi import requests
# Impersonate Chrome 120
response = requests.get(
"https://protected-site.com",
impersonate="chrome120"
)
print(response.status_code) # 200!
This one change gets you past roughly 90% of anti-bot systems. Most of them lean on TLS fingerprinting as the primary check.
2. HTTP/2 Fingerprinting
Similar to TLS fingerprinting, but at the HTTP protocol level. Anti-bot systems analyze your HTTP/2 settings frames, header order, and priority signals.
Standard Python HTTP libraries send HTTP/1.1 by default, or send HTTP/2 with settings that scream "I'm a bot."
curl_cffi handles this too. It sends HTTP/2 frames that match real browsers exactly.
3. JavaScript Challenges
Cloudflare Turnstile, DataDome's JS challenge, and similar systems require your client to execute JavaScript. They inject a script that:
- 1.Checks browser APIs (canvas, WebGL, fonts)
- 2.Measures mouse movement and timing
- 3.Generates a token that must be sent with subsequent requests
For complex challenges, use a stealth browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
viewport={"width": 1920, "height": 1080},
)
page = context.new_page()
# Navigate and let the challenge complete
page.goto("https://protected-site.com")
page.wait_for_load_state("networkidle")
# Now extract your data
content = page.content()
4. Behavioral Analysis
Advanced systems track:
- •Request patterns: do you hit pages in an order no human would?
- •Timing: are your requests suspiciously evenly spaced?
- •Session behavior: do you visit the homepage before hitting product pages?
import random
import time
def human_delay():
"""Random delay between 1-4 seconds"""
time.sleep(random.uniform(1.0, 4.0))
# Between each request
human_delay()
Anti-Bot Systems: A Quick Reference
Cloudflare
Detects via TLS fingerprinting, JavaScript challenges (Turnstile), rate limiting, and IP reputation. Difficulty: medium to hard.
curl_cffi with impersonate="chrome120" handles most Cloudflare-protected sites. For Turnstile challenges, you may need a headless browser for the initial challenge, then reuse the cf_clearance cookie.
DataDome
Detects via TLS fingerprinting, JavaScript fingerprinting, behavioral analysis, and device fingerprinting. Difficulty: hard.
Residential proxies + curl_cffi + proper headers. DataDome is aggressive about flagging datacenter IPs.
PerimeterX (HUMAN)
Detects via JavaScript challenges, behavioral analysis, and sensor data collection. Difficulty: hard.
Stealth Playwright with realistic mouse movements. PerimeterX leans on behavioral signals more than most.
Akamai Bot Manager
Detects via TLS fingerprinting, HTTP/2 fingerprinting, sensor data, and cookie validation. Difficulty: very hard.
Full browser automation with residential proxies. Akamai's sensor data collection is the most sophisticated of the bunch.
Essential Headers for Every Request
No matter which anti-bot system you're dealing with, always send realistic headers:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Ch-Ua": '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
}
Missing or incorrect headers are the easiest way to get flagged.
The Proxy Factor
Even with perfect TLS impersonation and headers, you'll get blocked if you send too many requests from the same IP. Proxies are non-negotiable for serious scraping:
- •Datacenter proxies are cheap and fast, but many sites block known datacenter IP ranges
- •Residential proxies cost more, but use real ISP IPs that look like regular users
- •Mobile proxies are the hardest to block since mobile IPs are shared among many users
Putting It All Together
Here's a template that combines all the techniques above:
from curl_cffi import requests
import random
import time
session = requests.Session(impersonate="chrome120")
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
def scrape_page(url):
time.sleep(random.uniform(1.5, 3.5))
response = session.get(url, headers=headers)
if response.status_code == 403:
print(f"Blocked on {url} — try rotating proxy")
return None
return response.text
This gets you past most anti-bot systems. For the toughest sites, combine it with browser automation and residential proxies.
Want to Go Deeper?
Chapter 11 of the Master Web Scraping course covers anti-bot evasion with hands-on exercises against real protected sites, including TLS impersonation, cookie extraction workflows, and stealth browser setups.