Async Web Scraping: Speed Up Python Scraping with asyncio
Async scraping uses asynchronous programming (Python's asyncio) to send multiple HTTP requests concurrently instead of waiting for each one to complete before starting the next. This can speed up scraping by 10-50x.
Why Async?
With synchronous scraping, you wait for each page to download before requesting the next:
Page 1 (2s) → Page 2 (2s) → Page 3 (2s) = 6 seconds total
With async scraping, you request all pages simultaneously:
Page 1 (2s) ↘
Page 2 (2s) → All done in ~2 seconds
Page 3 (2s) ↗
Basic Async Scraping
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch(session, url):
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, "lxml")
title = soup.select_one("h1").text
return {"url": url, "title": title}
async def main():
urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
asyncio.run(main())
Rate Limiting with Semaphore
Don't blast a server with 1,000 concurrent requests. Use a semaphore:
semaphore = asyncio.Semaphore(10) # max 10 concurrent
async def fetch_limited(session, url):
async with semaphore:
async with session.get(url) as response:
return await response.text()
When to Use Async Scraping
- •Scraping hundreds or thousands of pages from the same site
- •Pages are mostly I/O-bound (waiting for server response)
- •You need to finish faster without adding more machines
When NOT to Use Async
- •Small scraping jobs (under 50 pages) — not worth the complexity
- •CPU-bound processing (use multiprocessing instead)
- •Sites with strict rate limiting (async won't help if you're limited to 1 req/sec)
Async vs. Multiprocessing vs. Threading
| Approach | Best For | Overhead |
|---|---|---|
| Async (asyncio) | I/O-bound (HTTP requests) | Low |
| Threading | I/O-bound (simpler API) | Medium |
| Multiprocessing | CPU-bound (data processing) | High |