Skip to main content

What Is Proxy Rotation? IP Management for Web Scraping

advanced

Proxy rotation is the practice of distributing web scraping requests across multiple IP addresses by cycling through a pool of proxy servers. This prevents any single IP from being rate-limited or blocked.

Why Proxies Matter for Scraping

When you scrape from a single IP address, the target site sees hundreds or thousands of requests coming from the same source. This is trivially easy to detect. The site will rate-limit you, serve CAPTCHAs, or outright block your IP. Sometimes the block is temporary (hours), sometimes permanent.

Proxy rotation solves this by distributing your requests across many different IP addresses. To the target site, your traffic looks like it comes from hundreds of different users in different locations. No single IP makes enough requests to trigger alarms.

Even beyond anti-bot detection, proxies serve other purposes:

  • Geographic targeting: Access region-locked content by routing through IPs in specific countries
  • Redundancy: If one IP gets blocked, your scraper keeps working through others
  • Speed: Parallel requests through different proxies can increase throughput
  • Anonymity: Your real IP is never exposed to the target site

Types of Proxies

There are four main types of proxies used in scraping, each with different characteristics and price points.

Datacenter Proxies

These IPs come from data centers (cloud providers like AWS, GCP, DigitalOcean). They are fast, cheap, and available in bulk. The downside: they are easy to identify. Anti-bot systems maintain lists of known datacenter IP ranges. If a site uses any serious protection, datacenter proxies will fail.

Residential Proxies

These IPs belong to real internet service providers and are assigned to real homes. When you route traffic through a residential proxy, it looks like a request from a regular household. They are much harder to detect but slower and more expensive than datacenter proxies.

Mobile Proxies

These use IP addresses assigned by mobile carriers (4G/5G). Mobile IPs are shared among many users via carrier-grade NAT, so blocking a mobile IP would block thousands of legitimate users. This makes them nearly undetectable. They are the most expensive option.

ISP/Static Residential Proxies

A hybrid: datacenter-hosted IPs registered to ISPs. You get the speed of datacenter proxies with the trust score of residential IPs. Good for persistent sessions where you need the same IP over time.

Proxy Comparison Table

TypeCost per GBSpeedDetection RiskIP Pool SizeBest For
Datacenter$0.50-$2Very fast (1-10ms)HighHuge (millions)Unprotected sites, high volume
Residential$5-$15Medium (50-200ms)LowLarge (millions)Anti-bot protected sites
ISP/Static$10-$25Fast (10-50ms)Very lowSmall (thousands)Login sessions, account-based
Mobile$20-$50+Variable (100-500ms)LowestMediumHardest targets (Nike, Ticketmaster)

Implementing Rotation with Requests

Basic Random Rotation

python
import requests
import random

proxy_list = [ "http://user:pass@proxy1.example.com:8080", "http://user:pass@proxy2.example.com:8080", "http://user:pass@proxy3.example.com:8080", "http://user:pass@proxy4.example.com:8080", "http://user:pass@proxy5.example.com:8080", ]

def get_with_proxy(url, max_retries=3): """Make a request through a random proxy with retry logic.""" for attempt in range(max_retries): proxy = random.choice(proxy_list) try: response = requests.get( url, proxies={"http": proxy, "https": proxy}, timeout=15, ) if response.status_code == 200: return response elif response.status_code == 429: print(f"Rate limited via {proxy}, retrying...") continue except (requests.exceptions.ProxyError, requests.exceptions.Timeout): print(f"Proxy failed: {proxy}") continue return None

Round-Robin with Health Tracking

python
import requests
from itertools import cycle
from collections import defaultdict

class ProxyRotator: def __init__(self, proxies): self.proxies = proxies self.proxy_cycle = cycle(proxies) self.failures = defaultdict(int) self.max_failures = 3 # Remove proxy after 3 consecutive failures

def get_proxy(self): """Get next healthy proxy in rotation.""" for _ in range(len(self.proxies)): proxy = next(self.proxy_cycle) if self.failures[proxy] < self.max_failures: return proxy raise Exception("All proxies exhausted")

def mark_success(self, proxy): self.failures[proxy] = 0

def mark_failure(self, proxy): self.failures[proxy] += 1

def request(self, url): proxy = self.get_proxy() try: response = requests.get( url, proxies={"http": proxy, "https": proxy}, timeout=15, ) self.mark_success(proxy) return response except Exception: self.mark_failure(proxy) return self.request(url) # Retry with next proxy

# Usage rotator = ProxyRotator(proxy_list) for url in urls: response = rotator.request(url)

Implementing Rotation with Scrapy Middleware

Scrapy makes proxy rotation clean through downloader middleware:

python
# middlewares.py
import random

class RotatingProxyMiddleware: def __init__(self, proxy_list): self.proxies = proxy_list self.failed_proxies = set()

@classmethod def from_crawler(cls, crawler): proxy_list = crawler.settings.getlist("PROXY_LIST") return cls(proxy_list)

def process_request(self, request, spider): available = [p for p in self.proxies if p not in self.failed_proxies] if not available: self.failed_proxies.clear() # Reset and try again available = self.proxies request.meta["proxy"] = random.choice(available)

def process_response(self, request, response, spider): if response.status in [403, 429, 503]: proxy = request.meta.get("proxy") spider.logger.warning(f"Proxy blocked: {proxy}") self.failed_proxies.add(proxy) # Retry with a different proxy return request.replace(dont_filter=True) return response

def process_exception(self, request, exception, spider): proxy = request.meta.get("proxy") self.failed_proxies.add(proxy) return request.replace(dont_filter=True)

python
# settings.py
PROXY_LIST = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

DOWNLOADER_MIDDLEWARES = { "myproject.middlewares.RotatingProxyMiddleware": 350, }

Backconnect Proxies vs. Rotating Lists

There are two models for proxy rotation:

Self-managed rotation: You buy a list of proxy IPs and rotate through them yourself (the examples above). You have full control but must handle health checking, rotation logic, and replacing dead proxies. Backconnect (gateway) proxies: You connect to a single gateway URL, and the provider rotates IPs on the backend. Each request automatically uses a different IP from the provider's pool.
python
# Backconnect proxy - same URL, different IP each request
proxy = "http://user:pass@gate.provider.com:7777"

for url in urls: response = requests.get( url, proxies={"http": proxy, "https": proxy}, timeout=15, ) # Each request goes through a different exit IP automatically

Backconnect proxies are simpler to use and give you access to much larger IP pools (often millions of IPs). The tradeoff is less control over which specific IPs you use.

Testing Proxies and Handling Failures

Before using proxies in production, validate them:

python
import requests
import concurrent.futures

def test_proxy(proxy, timeout=10): """Test if a proxy is working and measure its speed.""" try: response = requests.get( "https://httpbin.org/ip", proxies={"http": proxy, "https": proxy}, timeout=timeout, ) if response.status_code == 200: exit_ip = response.json()["origin"] elapsed = response.elapsed.total_seconds() return {"proxy": proxy, "ip": exit_ip, "speed": elapsed, "working": True} except Exception as e: return {"proxy": proxy, "error": str(e), "working": False}

# Test all proxies in parallel with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: results = list(executor.map(test_proxy, proxy_list))

working = [r for r in results if r["working"]] print(f"{len(working)}/{len(proxy_list)} proxies working")

# Sort by speed working.sort(key=lambda x: x["speed"]) for r in working: print(f" {r['ip']} - {r['speed']:.2f}s")

Free vs. Paid Proxy Providers

Free proxies (from lists like free-proxy-list.net) are tempting but almost always a bad idea. They are slow, unreliable, short-lived (often dead within hours), and potentially dangerous (the proxy operator can see your traffic). Never send authentication credentials through free proxies. Paid providers give you reliable, fast proxies with support and SLAs. The major providers for scraping:
ProviderStarting PriceProxy TypesKey Feature
Bright Data$5.04/GB residentialAll typesLargest IP pool (72M+)
Oxylabs$8/GB residentialAll typesEnterprise-grade, fast
Smartproxy$4.50/GB residentialResidential, datacenterGood value for mid-scale
ScraperAPI$29/mo (250K requests)Managed rotationAPI-based, handles rotation for you
IPRoyal$1.75/GB residentialResidential, datacenterBudget option
For most scraping projects, a residential proxy plan at $5-10/GB is the sweet spot between cost and detection avoidance.

Sticky Sessions and When You Need Them

Sticky sessions keep the same IP address across multiple requests. This is essential for:

  • Login flows: The site expects all requests in a session to come from the same IP
  • Multi-step forms: Submitting forms that span multiple pages
  • Shopping carts: Adding items and checking out
  • Any stateful interaction: Where the server ties your session to your IP
python
# Sticky session with a backconnect proxy (provider-specific syntax)
# Most providers use a session ID in the username
proxy = "http://user-session-abc123:pass@gate.provider.com:7777"

session = requests.Session() session.proxies = {"http": proxy, "https": proxy}

# All requests in this session use the same exit IP session.get("https://example.com/login") session.post("https://example.com/login", data={"user": "...", "pass": "..."}) session.get("https://example.com/dashboard") # Same IP as login

Without sticky sessions, your login request might come from IP-A, but the dashboard request comes from IP-B. The server sees an unauthenticated request from IP-B and redirects you to login.

Common Proxy Mistakes

Leaking your real IP: If a proxy fails, requests might fall back to your real IP. Always handle proxy errors explicitly and never let a request proceed without a proxy.
python
# Bad: if proxy fails, falls back to real IP on retry
response = requests.get(url, proxies={"http": proxy, "https": proxy})

# Good: raise on failure, never expose real IP try: response = requests.get( url, proxies={"http": proxy, "https": proxy}, timeout=10, ) except requests.exceptions.ProxyError: # Switch proxy, do NOT retry without proxy pass

DNS leaks: Your DNS queries might bypass the proxy and reveal your real IP. Use the proxy for DNS resolution too. With SOCKS5 proxies, use socks5h:// (the 'h' means DNS resolution happens on the proxy side). Using the same proxy too frequently: Even with a large pool, hammering one proxy will get that IP flagged. Distribute requests evenly. Not matching proxy geography to target: If you scrape a US-only site through a German proxy, the site might block you or serve different content. Use proxies in the same region as your target.

Cost Optimization Strategies

Proxy costs can add up fast. Here are practical ways to minimize your spend:

  1. 1.Use datacenter proxies where possible: If the site does not have anti-bot protection, datacenter proxies at $0.50-2/GB save significant money compared to residential at $5-15/GB.
  1. 2.Cache aggressively: Do not re-scrape pages you have already fetched. Save raw HTML during development so you are not burning proxy bandwidth while refining your parsing logic.
  1. 3.Block unnecessary resources: When using Playwright with proxies, block images, CSS, and fonts. These can account for 70-80% of bandwidth.
  1. 4.Target the API, not the page: Check the Network tab in DevTools. If the site loads data from an API, hitting the API endpoint directly uses a fraction of the bandwidth compared to loading the full page.
  1. 5.Use conditional requests: Send If-Modified-Since or If-None-Match headers. If the content has not changed, the server returns a 304 with no body.
  1. 6.Optimize request frequency: Scrape during off-peak hours when rate limits may be more lenient. Batch your scraping runs rather than running continuously.

Real-World Rotation Pattern

This pattern combines proxy rotation, user agent rotation, retry logic, and health tracking into a production-ready scraper:

python
import requests
import random
import time
from collections import defaultdict

class ProductionScraper: def __init__(self, proxies, max_retries=3, base_delay=1.0): self.proxies = proxies self.max_retries = max_retries self.base_delay = base_delay self.proxy_failures = defaultdict(int) self.user_agents = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36", ]

def _get_proxy(self): healthy = [p for p in self.proxies if self.proxy_failures[p] < 5] if not healthy: self.proxy_failures.clear() healthy = self.proxies return random.choice(healthy)

def scrape(self, url): for attempt in range(self.max_retries): proxy = self._get_proxy() headers = {"User-Agent": random.choice(self.user_agents)}

try: response = requests.get( url, proxies={"http": proxy, "https": proxy}, headers=headers, timeout=15, )

if response.status_code == 200: self.proxy_failures[proxy] = 0 return response elif response.status_code == 429: self.proxy_failures[proxy] += 1 wait = self.base_delay * (2 ** attempt) time.sleep(wait) elif response.status_code == 403: self.proxy_failures[proxy] += 2 continue

except Exception: self.proxy_failures[proxy] += 1 continue

return None

# Usage scraper = ProductionScraper(proxy_list) for url in urls: response = scraper.scrape(url) if response: # parse response... pass time.sleep(random.uniform(0.5, 2.0))

Next Steps

  1. 7.Start without proxies. Many sites do not need them if you scrape politely with delays.
  2. 8.If you get blocked, try datacenter proxies first (cheapest option).
  3. 9.If datacenter proxies get detected, upgrade to residential.
  4. 10.Use a backconnect gateway to simplify rotation logic.
  5. 11.Track your proxy costs per scrape to find optimization opportunities.
  6. 12.Look into proxy integration with Scrapy middleware for large-scale projects.

Learn Proxy Rotation hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use proxy rotation in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you