Skip to main content

What Is Scrapy? Python Web Crawling Framework Explained

intermediate

Scrapy is an open-source Python framework designed for web crawling and scraping at scale. It provides built-in support for following links, handling retries, managing concurrency, and exporting data through pipelines.

How Scrapy Works

Unlike BeautifulSoup (a parser library) or Playwright (a browser automation tool), Scrapy is a complete framework. It manages the entire scraping pipeline: scheduling requests, downloading pages, parsing responses, processing data, and exporting results. You define what to scrape and how. Scrapy handles concurrency, retries, rate limiting, and data flow automatically.

Scrapy is event-driven and built on Twisted (an asynchronous networking library). This means it can handle many concurrent requests without threading. A single Scrapy spider can comfortably make 16+ simultaneous requests, crawling thousands of pages in minutes.

Architecture Deep Dive

Scrapy's architecture has six core components that work together in a pipeline:

  1. 1.Engine: The central coordinator. Manages data flow between all components.
  2. 2.Scheduler: Receives requests from the engine and queues them. Handles deduplication so you do not crawl the same URL twice.
  3. 3.Downloader: Fetches pages from the web. Handles HTTP connections, redirects, and cookies.
  4. 4.Spiders: Your code. Defines how to parse responses and what data to extract.
  5. 5.Item Pipelines: Process extracted items. Clean data, validate fields, remove duplicates, and save to storage.
  6. 6.Middlewares: Hooks into the request/response cycle. Two types: Downloader Middlewares (modify requests/responses) and Spider Middlewares (modify spider input/output).
The flow: Spider yields a Request > Engine sends it to Scheduler > Scheduler queues it > Engine asks Downloader to fetch it > Response goes through Downloader Middleware > Engine sends Response to Spider > Spider yields Items and new Requests > Items go through Item Pipeline > New Requests go back to Scheduler.

Creating a Project and First Spider

python
# Create a new Scrapy project
# Run in terminal:
# scrapy startproject myproject
# cd myproject
# scrapy genspider products example.com

This creates the following structure:

code
myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            products.py

A basic spider:

python
import scrapy

class ProductSpider(scrapy.Spider): name = "products" start_urls = ["https://example.com/products"]

def parse(self, response): for product in response.css(".product-card"): yield { "name": product.css(".title::text").get(), "price": product.css(".price::text").get(), "url": response.urljoin(product.css("a::attr(href)").get()), }

# Follow pagination next_page = response.css("a.next-page::attr(href)").get() if next_page: yield response.follow(next_page, self.parse)

Run it:

python
# Run from terminal:
# scrapy crawl products -o products.json

Selectors: CSS and XPath

Scrapy supports both CSS selectors and XPath. You can use whichever you prefer, or mix them.

python
# CSS selectors
response.css(".product-card")              # All matching elements
response.css(".title::text").get()          # Text content of first match
response.css(".title::text").getall()       # Text content of all matches
response.css("a::attr(href)").get()         # Attribute value
response.css(".price::text").re_first(r"[\d.]+")  # Regex on text

# XPath selectors response.xpath("//div[@class='product-card']") response.xpath(".//h2/text()").get() response.xpath("//a/@href").getall() response.xpath("//span[contains(@class, 'price')]/text()").get()

CSS selectors are generally easier to read. XPath is more powerful for complex queries (like selecting by text content or navigating up the tree). Use CSS as your default and switch to XPath when CSS cannot express what you need.

Following Links and Pagination

Scrapy makes it easy to follow links and handle pagination:

python
class CrawlSpider(scrapy.Spider):
    name = "full_crawl"
    start_urls = ["https://example.com/products"]

def parse(self, response): # Extract data from current page for card in response.css(".product-card"): # Follow link to detail page detail_url = card.css("a::attr(href)").get() if detail_url: yield response.follow( detail_url, callback=self.parse_detail, meta={"name": card.css(".title::text").get()} )

# Follow pagination next_page = response.css("a.next::attr(href)").get() if next_page: yield response.follow(next_page, self.parse)

def parse_detail(self, response): yield { "name": response.meta["name"], "description": response.css(".description::text").get(), "specs": response.css(".spec-table td::text").getall(), "url": response.url, }

The response.follow() method handles relative URLs automatically. The meta parameter lets you pass data between callbacks.

Item Pipelines

Pipelines process items after extraction. Common uses: data cleaning, validation, deduplication, and storage.

python
# pipelines.py
import re

class CleanPricePipeline: """Remove currency symbols and convert price to float.""" def process_item(self, item, spider): if "price" in item and item["price"]: clean = re.sub(r"[^\d.]", "", item["price"]) item["price"] = float(clean) if clean else 0.0 return item

class DuplicatesPipeline: """Drop duplicate items based on URL.""" def __init__(self): self.seen_urls = set()

def process_item(self, item, spider): url = item.get("url", "") if url in self.seen_urls: raise scrapy.exceptions.DropItem(f"Duplicate: {url}") self.seen_urls.add(url) return item

class SaveToDBPipeline: """Save items to a SQLite database.""" def open_spider(self, spider): import sqlite3 self.conn = sqlite3.connect("products.db") self.cursor = self.conn.cursor() self.cursor.execute(""" CREATE TABLE IF NOT EXISTS products (name TEXT, price REAL, url TEXT UNIQUE) """)

def process_item(self, item, spider): self.cursor.execute( "INSERT OR IGNORE INTO products VALUES (?, ?, ?)", (item["name"], item["price"], item["url"]) ) self.conn.commit() return item

def close_spider(self, spider): self.conn.close()

Enable pipelines in settings.py:

python
# settings.py
ITEM_PIPELINES = {
    "myproject.pipelines.CleanPricePipeline": 100,
    "myproject.pipelines.DuplicatesPipeline": 200,
    "myproject.pipelines.SaveToDBPipeline": 300,
}
# Lower number = higher priority (runs first)

Middleware for Proxies and User Agents

Downloader middlewares let you modify every request before it is sent. This is where you add proxy rotation and user agent randomization.

python
# middlewares.py
import random

class RandomUserAgentMiddleware: user_agents = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36", ]

def process_request(self, request, spider): request.headers["User-Agent"] = random.choice(self.user_agents)

class RotatingProxyMiddleware: def __init__(self): self.proxies = [ "http://user:pass@proxy1.example.com:8080", "http://user:pass@proxy2.example.com:8080", "http://user:pass@proxy3.example.com:8080", ]

def process_request(self, request, spider): request.meta["proxy"] = random.choice(self.proxies)

def process_exception(self, request, exception, spider): # Retry with a different proxy on failure request.meta["proxy"] = random.choice(self.proxies) return request

Enable in settings.py:

python
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.RandomUserAgentMiddleware": 400,
    "myproject.middlewares.RotatingProxyMiddleware": 350,
}

Settings for Concurrency, Delays, and Retries

Scrapy's settings control how aggressively the spider crawls. These are the most important ones:

python
# settings.py

# Concurrency CONCURRENT_REQUESTS = 16 # Total concurrent requests (default: 16) CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Max concurrent requests per domain CONCURRENT_REQUESTS_PER_IP = 0 # Max per IP (0 = no limit)

# Delays DOWNLOAD_DELAY = 1.0 # Seconds between requests to same domain RANDOMIZE_DOWNLOAD_DELAY = True # Randomize delay (0.5x to 1.5x)

# Retries RETRY_ENABLED = True RETRY_TIMES = 3 # Max retries per request RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Timeouts DOWNLOAD_TIMEOUT = 30 # Seconds before timeout

# Respect robots.txt ROBOTSTXT_OBEY = True

# Auto-throttle (adjusts speed based on server load) AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10 AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Caching (saves responses for development) HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 86400 # 24 hours

The AUTOTHROTTLE settings are particularly useful. They automatically slow down when the server is under load and speed up when it is responsive.

Exporting Data

Scrapy has built-in exporters for common formats:

python
# Command-line export
# scrapy crawl products -o products.json
# scrapy crawl products -o products.csv
# scrapy crawl products -o products.jsonl  # JSON Lines (one JSON per line)

# Or configure in settings.py for automatic export FEEDS = { "output/products.json": { "format": "json", "encoding": "utf-8", "overwrite": True, }, "output/products.csv": { "format": "csv", }, }

For databases, use Item Pipelines (shown in the pipeline section above). Common targets: SQLite for local development, PostgreSQL for production, MongoDB for document-oriented data.

scrapy-playwright Integration

For JavaScript-heavy sites, scrapy-playwright combines Scrapy's crawling framework with Playwright's browser rendering:

python
# pip install scrapy-playwright
# playwright install chromium

# settings.py DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", } TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" PLAYWRIGHT_BROWSER_TYPE = "chromium"

# spider.py import scrapy

class JSSpider(scrapy.Spider): name = "js_spider"

def start_requests(self): yield scrapy.Request( "https://example.com/spa-page", meta={"playwright": True, "playwright_include_page": True}, )

async def parse(self, response): page = response.meta["playwright_page"]

# You can interact with the page before parsing await page.wait_for_selector(".dynamic-content")

# Extract from the rendered HTML for card in response.css(".product-card"): yield { "name": card.css(".title::text").get(), "price": card.css(".price::text").get(), }

await page.close()

This gives you the best of both worlds: Scrapy's infrastructure (scheduling, pipelines, retries) with Playwright's JavaScript rendering.

Deployment Options

Scrapyd

Self-hosted daemon for running and managing spiders on your own server:

python
# pip install scrapyd scrapyd-client
# scrapyd  (starts the daemon)
# scrapyd-deploy default -p myproject  (deploys your project)
# curl http://localhost:6800/schedule.json -d project=myproject -d spider=products

Scrapy Cloud (Zyte)

Managed platform by Scrapy's creators. Handles infrastructure, scheduling, and monitoring. Best for teams that do not want to manage servers.

Docker

Package your spider in a Docker container and run it anywhere:

python
# Dockerfile
# FROM python:3.11-slim
# WORKDIR /app
# COPY . .
# RUN pip install scrapy
# CMD ["scrapy", "crawl", "products", "-o", "output.json"]

Real-World Example: Crawling an E-Commerce Site

python
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
import re

def clean_price(value): match = re.search(r"[\d.]+", value) return float(match.group()) if match else 0.0

class ProductItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() category = scrapy.Field() rating = scrapy.Field() url = scrapy.Field()

class EcommerceSpider(scrapy.Spider): name = "ecommerce" start_urls = ["https://example.com/categories"]

custom_settings = { "CONCURRENT_REQUESTS": 8, "DOWNLOAD_DELAY": 1.5, "FEEDS": {"products.jsonl": {"format": "jsonlines"}}, }

def parse(self, response): """Parse category listing page.""" for cat_link in response.css(".category-list a::attr(href)").getall(): yield response.follow(cat_link, self.parse_category)

def parse_category(self, response): """Parse products in a category.""" category = response.css("h1::text").get("Unknown")

for card in response.css(".product-card"): loader = ItemLoader(item=ProductItem(), selector=card) loader.default_output_processor = TakeFirst()

loader.add_css("name", ".title::text") loader.add_css("price", ".price::text", MapCompose(clean_price)) loader.add_value("category", category) loader.add_css("rating", ".stars::attr(data-rating)") loader.add_css("url", "a::attr(href)")

yield loader.load_item()

# Pagination next_page = response.css("a.next::attr(href)").get() if next_page: yield response.follow(next_page, self.parse_category)

When to Use Scrapy vs. Alternatives

ScenarioBest ToolWhy
Crawling 10,000+ pagesScrapyBuilt-in concurrency, scheduling, pipelines
Simple one-off scriptBeautifulSoup + requestsLess boilerplate, faster to write
JavaScript-rendered pagesPlaywright (or scrapy-playwright)Needs a browser engine
Login-protected site with JSPlaywrightBrowser handles cookies and JS
Recurring production scrapingScrapyBuilt for deployment and monitoring
Quick data extractionBeautifulSoupMinimal setup
Scrapy shines when you need structure, scale, and reliability. For small scripts, it adds unnecessary complexity. The sweet spot: if your scraping task involves more than a few hundred pages or needs to run on a schedule, Scrapy is probably the right choice.

Next Steps

  1. 7.Install Scrapy: pip install scrapy
  2. 8.Create a project: scrapy startproject myproject
  3. 9.Generate a spider: scrapy genspider products example.com
  4. 10.Run the Scrapy shell to test selectors: scrapy shell "https://example.com"
  5. 11.Build your first spider with pagination
  6. 12.Add an Item Pipeline to clean and store data
  7. 13.When you need JavaScript support, add scrapy-playwright

Learn Scrapy hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use scrapy in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you