What Is Scrapy? Python Web Crawling Framework Explained
Scrapy is an open-source Python framework designed for web crawling and scraping at scale. It provides built-in support for following links, handling retries, managing concurrency, and exporting data through pipelines.
How Scrapy Works
Unlike BeautifulSoup (a parser library) or Playwright (a browser automation tool), Scrapy is a complete framework. It manages the entire scraping pipeline: scheduling requests, downloading pages, parsing responses, processing data, and exporting results. You define what to scrape and how. Scrapy handles concurrency, retries, rate limiting, and data flow automatically.
Scrapy is event-driven and built on Twisted (an asynchronous networking library). This means it can handle many concurrent requests without threading. A single Scrapy spider can comfortably make 16+ simultaneous requests, crawling thousands of pages in minutes.
Architecture Deep Dive
Scrapy's architecture has six core components that work together in a pipeline:
- 1.Engine: The central coordinator. Manages data flow between all components.
- 2.Scheduler: Receives requests from the engine and queues them. Handles deduplication so you do not crawl the same URL twice.
- 3.Downloader: Fetches pages from the web. Handles HTTP connections, redirects, and cookies.
- 4.Spiders: Your code. Defines how to parse responses and what data to extract.
- 5.Item Pipelines: Process extracted items. Clean data, validate fields, remove duplicates, and save to storage.
- 6.Middlewares: Hooks into the request/response cycle. Two types: Downloader Middlewares (modify requests/responses) and Spider Middlewares (modify spider input/output).
Creating a Project and First Spider
# Create a new Scrapy project
# Run in terminal:
# scrapy startproject myproject
# cd myproject
# scrapy genspider products example.com
This creates the following structure:
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
products.py
A basic spider:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
for product in response.css(".product-card"):
yield {
"name": product.css(".title::text").get(),
"price": product.css(".price::text").get(),
"url": response.urljoin(product.css("a::attr(href)").get()),
}
# Follow pagination
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Run it:
# Run from terminal:
# scrapy crawl products -o products.json
Selectors: CSS and XPath
Scrapy supports both CSS selectors and XPath. You can use whichever you prefer, or mix them.
# CSS selectors
response.css(".product-card") # All matching elements
response.css(".title::text").get() # Text content of first match
response.css(".title::text").getall() # Text content of all matches
response.css("a::attr(href)").get() # Attribute value
response.css(".price::text").re_first(r"[\d.]+") # Regex on text
# XPath selectors
response.xpath("//div[@class='product-card']")
response.xpath(".//h2/text()").get()
response.xpath("//a/@href").getall()
response.xpath("//span[contains(@class, 'price')]/text()").get()
CSS selectors are generally easier to read. XPath is more powerful for complex queries (like selecting by text content or navigating up the tree). Use CSS as your default and switch to XPath when CSS cannot express what you need.
Following Links and Pagination
Scrapy makes it easy to follow links and handle pagination:
class CrawlSpider(scrapy.Spider):
name = "full_crawl"
start_urls = ["https://example.com/products"]
def parse(self, response):
# Extract data from current page
for card in response.css(".product-card"):
# Follow link to detail page
detail_url = card.css("a::attr(href)").get()
if detail_url:
yield response.follow(
detail_url,
callback=self.parse_detail,
meta={"name": card.css(".title::text").get()}
)
# Follow pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_detail(self, response):
yield {
"name": response.meta["name"],
"description": response.css(".description::text").get(),
"specs": response.css(".spec-table td::text").getall(),
"url": response.url,
}
The response.follow() method handles relative URLs automatically. The meta parameter lets you pass data between callbacks.
Item Pipelines
Pipelines process items after extraction. Common uses: data cleaning, validation, deduplication, and storage.
# pipelines.py
import re
class CleanPricePipeline:
"""Remove currency symbols and convert price to float."""
def process_item(self, item, spider):
if "price" in item and item["price"]:
clean = re.sub(r"[^\d.]", "", item["price"])
item["price"] = float(clean) if clean else 0.0
return item
class DuplicatesPipeline:
"""Drop duplicate items based on URL."""
def __init__(self):
self.seen_urls = set()
def process_item(self, item, spider):
url = item.get("url", "")
if url in self.seen_urls:
raise scrapy.exceptions.DropItem(f"Duplicate: {url}")
self.seen_urls.add(url)
return item
class SaveToDBPipeline:
"""Save items to a SQLite database."""
def open_spider(self, spider):
import sqlite3
self.conn = sqlite3.connect("products.db")
self.cursor = self.conn.cursor()
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS products
(name TEXT, price REAL, url TEXT UNIQUE)
""")
def process_item(self, item, spider):
self.cursor.execute(
"INSERT OR IGNORE INTO products VALUES (?, ?, ?)",
(item["name"], item["price"], item["url"])
)
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
Enable pipelines in settings.py:
# settings.py
ITEM_PIPELINES = {
"myproject.pipelines.CleanPricePipeline": 100,
"myproject.pipelines.DuplicatesPipeline": 200,
"myproject.pipelines.SaveToDBPipeline": 300,
}
# Lower number = higher priority (runs first)
Middleware for Proxies and User Agents
Downloader middlewares let you modify every request before it is sent. This is where you add proxy rotation and user agent randomization.
# middlewares.py
import random
class RandomUserAgentMiddleware:
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.user_agents)
class RotatingProxyMiddleware:
def __init__(self):
self.proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(self.proxies)
def process_exception(self, request, exception, spider):
# Retry with a different proxy on failure
request.meta["proxy"] = random.choice(self.proxies)
return request
Enable in settings.py:
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.RandomUserAgentMiddleware": 400,
"myproject.middlewares.RotatingProxyMiddleware": 350,
}
Settings for Concurrency, Delays, and Retries
Scrapy's settings control how aggressively the spider crawls. These are the most important ones:
# settings.py
# Concurrency
CONCURRENT_REQUESTS = 16 # Total concurrent requests (default: 16)
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Max concurrent requests per domain
CONCURRENT_REQUESTS_PER_IP = 0 # Max per IP (0 = no limit)
# Delays
DOWNLOAD_DELAY = 1.0 # Seconds between requests to same domain
RANDOMIZE_DOWNLOAD_DELAY = True # Randomize delay (0.5x to 1.5x)
# Retries
RETRY_ENABLED = True
RETRY_TIMES = 3 # Max retries per request
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Timeouts
DOWNLOAD_TIMEOUT = 30 # Seconds before timeout
# Respect robots.txt
ROBOTSTXT_OBEY = True
# Auto-throttle (adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Caching (saves responses for development)
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 hours
The AUTOTHROTTLE settings are particularly useful. They automatically slow down when the server is under load and speed up when it is responsive.
Exporting Data
Scrapy has built-in exporters for common formats:
# Command-line export
# scrapy crawl products -o products.json
# scrapy crawl products -o products.csv
# scrapy crawl products -o products.jsonl # JSON Lines (one JSON per line)
# Or configure in settings.py for automatic export
FEEDS = {
"output/products.json": {
"format": "json",
"encoding": "utf-8",
"overwrite": True,
},
"output/products.csv": {
"format": "csv",
},
}
For databases, use Item Pipelines (shown in the pipeline section above). Common targets: SQLite for local development, PostgreSQL for production, MongoDB for document-oriented data.
scrapy-playwright Integration
For JavaScript-heavy sites, scrapy-playwright combines Scrapy's crawling framework with Playwright's browser rendering:
# pip install scrapy-playwright
# playwright install chromium
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
# spider.py
import scrapy
class JSSpider(scrapy.Spider):
name = "js_spider"
def start_requests(self):
yield scrapy.Request(
"https://example.com/spa-page",
meta={"playwright": True, "playwright_include_page": True},
)
async def parse(self, response):
page = response.meta["playwright_page"]
# You can interact with the page before parsing
await page.wait_for_selector(".dynamic-content")
# Extract from the rendered HTML
for card in response.css(".product-card"):
yield {
"name": card.css(".title::text").get(),
"price": card.css(".price::text").get(),
}
await page.close()
This gives you the best of both worlds: Scrapy's infrastructure (scheduling, pipelines, retries) with Playwright's JavaScript rendering.
Deployment Options
Scrapyd
Self-hosted daemon for running and managing spiders on your own server:
# pip install scrapyd scrapyd-client
# scrapyd (starts the daemon)
# scrapyd-deploy default -p myproject (deploys your project)
# curl http://localhost:6800/schedule.json -d project=myproject -d spider=products
Scrapy Cloud (Zyte)
Managed platform by Scrapy's creators. Handles infrastructure, scheduling, and monitoring. Best for teams that do not want to manage servers.
Docker
Package your spider in a Docker container and run it anywhere:
# Dockerfile
# FROM python:3.11-slim
# WORKDIR /app
# COPY . .
# RUN pip install scrapy
# CMD ["scrapy", "crawl", "products", "-o", "output.json"]
Real-World Example: Crawling an E-Commerce Site
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
import re
def clean_price(value):
match = re.search(r"[\d.]+", value)
return float(match.group()) if match else 0.0
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
category = scrapy.Field()
rating = scrapy.Field()
url = scrapy.Field()
class EcommerceSpider(scrapy.Spider):
name = "ecommerce"
start_urls = ["https://example.com/categories"]
custom_settings = {
"CONCURRENT_REQUESTS": 8,
"DOWNLOAD_DELAY": 1.5,
"FEEDS": {"products.jsonl": {"format": "jsonlines"}},
}
def parse(self, response):
"""Parse category listing page."""
for cat_link in response.css(".category-list a::attr(href)").getall():
yield response.follow(cat_link, self.parse_category)
def parse_category(self, response):
"""Parse products in a category."""
category = response.css("h1::text").get("Unknown")
for card in response.css(".product-card"):
loader = ItemLoader(item=ProductItem(), selector=card)
loader.default_output_processor = TakeFirst()
loader.add_css("name", ".title::text")
loader.add_css("price", ".price::text", MapCompose(clean_price))
loader.add_value("category", category)
loader.add_css("rating", ".stars::attr(data-rating)")
loader.add_css("url", "a::attr(href)")
yield loader.load_item()
# Pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse_category)
When to Use Scrapy vs. Alternatives
| Scenario | Best Tool | Why |
|---|---|---|
| Crawling 10,000+ pages | Scrapy | Built-in concurrency, scheduling, pipelines |
| Simple one-off script | BeautifulSoup + requests | Less boilerplate, faster to write |
| JavaScript-rendered pages | Playwright (or scrapy-playwright) | Needs a browser engine |
| Login-protected site with JS | Playwright | Browser handles cookies and JS |
| Recurring production scraping | Scrapy | Built for deployment and monitoring |
| Quick data extraction | BeautifulSoup | Minimal setup |
Next Steps
- 7.Install Scrapy:
pip install scrapy - 8.Create a project:
scrapy startproject myproject - 9.Generate a spider:
scrapy genspider products example.com - 10.Run the Scrapy shell to test selectors:
scrapy shell "https://example.com" - 11.Build your first spider with pagination
- 12.Add an Item Pipeline to clean and store data
- 13.When you need JavaScript support, add scrapy-playwright