What Is Scrapy? Python Web Crawling Framework Explained

intermediate

Scrapy is an open-source Python framework designed for web crawling and scraping at scale. It provides built-in support for following links, handling retries, managing concurrency, and exporting data through pipelines.

How Scrapy Works

Unlike BeautifulSoup (a parser library) or Playwright (a browser automation tool), Scrapy is a complete framework. It manages the entire scraping pipeline: scheduling requests, downloading pages, parsing responses, processing data, and exporting results. You define what to scrape and how. Scrapy handles concurrency, retries, rate limiting, and data flow automatically.

Scrapy is event-driven and built on Twisted (an asynchronous networking library). This means it can handle many concurrent requests without threading. A single Scrapy spider can comfortably make 16+ simultaneous requests, crawling thousands of pages in minutes.

Architecture Deep Dive

Scrapy's architecture has six core components that work together in a pipeline:

1.Engine: The central coordinator. Manages data flow between all components.
2.Scheduler: Receives requests from the engine and queues them. Handles deduplication so you do not crawl the same URL twice.
3.Downloader: Fetches pages from the web. Handles HTTP connections, redirects, and cookies.
4.Spiders: Your code. Defines how to parse responses and what data to extract.
5.Item Pipelines: Process extracted items. Clean data, validate fields, remove duplicates, and save to storage.
6.Middlewares: Hooks into the request/response cycle. Two types: Downloader Middlewares (modify requests/responses) and Spider Middlewares (modify spider input/output).

The flow: Spider yields a Request > Engine sends it to Scheduler > Scheduler queues it > Engine asks Downloader to fetch it > Response goes through Downloader Middleware > Engine sends Response to Spider > Spider yields Items and new Requests > Items go through Item Pipeline > New Requests go back to Scheduler.

Creating a Project and First Spider

python

# Create a new Scrapy project
# Run in terminal:
# scrapy startproject myproject
# cd myproject
# scrapy genspider products example.com

This creates the following structure:

code

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            products.py

A basic spider:

python

import scrapy
class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
def parse(self, response):
        for product in response.css(".product-card"):
            yield {
                "name": product.css(".title::text").get(),
                "price": product.css(".price::text").get(),
                "url": response.urljoin(product.css("a::attr(href)").get()),
            }
# Follow pagination
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

python

# Run from terminal:
# scrapy crawl products -o products.json

Selectors: CSS and XPath

Scrapy supports both CSS selectors and XPath. You can use whichever you prefer, or mix them.

python

# CSS selectors
response.css(".product-card")              # All matching elements
response.css(".title::text").get()          # Text content of first match
response.css(".title::text").getall()       # Text content of all matches
response.css("a::attr(href)").get()         # Attribute value
response.css(".price::text").re_first(r"[\d.]+")  # Regex on text
# XPath selectors
response.xpath("//div[@class='product-card']")
response.xpath(".//h2/text()").get()
response.xpath("//a/@href").getall()
response.xpath("//span[contains(@class, 'price')]/text()").get()

CSS selectors are generally easier to read. XPath is more powerful for complex queries (like selecting by text content or navigating up the tree). Use CSS as your default and switch to XPath when CSS cannot express what you need.

Following Links and Pagination

Scrapy makes it easy to follow links and handle pagination:

python

class CrawlSpider(scrapy.Spider):
    name = "full_crawl"
    start_urls = ["https://example.com/products"]
def parse(self, response):
        # Extract data from current page
        for card in response.css(".product-card"):
            # Follow link to detail page
            detail_url = card.css("a::attr(href)").get()
            if detail_url:
                yield response.follow(
                    detail_url,
                    callback=self.parse_detail,
                    meta={"name": card.css(".title::text").get()}
                )
# Follow pagination
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)
def parse_detail(self, response):
        yield {
            "name": response.meta["name"],
            "description": response.css(".description::text").get(),
            "specs": response.css(".spec-table td::text").getall(),
            "url": response.url,
        }

The response.follow() method handles relative URLs automatically. The meta parameter lets you pass data between callbacks.

Item Pipelines

Pipelines process items after extraction. Common uses: data cleaning, validation, deduplication, and storage.

python

# pipelines.py
import re
class CleanPricePipeline:
    """Remove currency symbols and convert price to float."""
    def process_item(self, item, spider):
        if "price" in item and item["price"]:
            clean = re.sub(r"[^\d.]", "", item["price"])
            item["price"] = float(clean) if clean else 0.0
        return item
class DuplicatesPipeline:
    """Drop duplicate items based on URL."""
    def __init__(self):
        self.seen_urls = set()
def process_item(self, item, spider):
        url = item.get("url", "")
        if url in self.seen_urls:
            raise scrapy.exceptions.DropItem(f"Duplicate: {url}")
        self.seen_urls.add(url)
        return item
class SaveToDBPipeline:
    """Save items to a SQLite database."""
    def open_spider(self, spider):
        import sqlite3
        self.conn = sqlite3.connect("products.db")
        self.cursor = self.conn.cursor()
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products
            (name TEXT, price REAL, url TEXT UNIQUE)
        """)
def process_item(self, item, spider):
        self.cursor.execute(
            "INSERT OR IGNORE INTO products VALUES (?, ?, ?)",
            (item["name"], item["price"], item["url"])
        )
        self.conn.commit()
        return item
def close_spider(self, spider):
        self.conn.close()

Enable pipelines in settings.py:

python

# settings.py
ITEM_PIPELINES = {
    "myproject.pipelines.CleanPricePipeline": 100,
    "myproject.pipelines.DuplicatesPipeline": 200,
    "myproject.pipelines.SaveToDBPipeline": 300,
}
# Lower number = higher priority (runs first)

Middleware for Proxies and User Agents

Downloader middlewares let you modify every request before it is sent. This is where you add proxy rotation and user agent randomization.

python

# middlewares.py
import random
class RandomUserAgentMiddleware:
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    ]
def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.user_agents)
class RotatingProxyMiddleware:
    def __init__(self):
        self.proxies = [
            "http://user:pass@proxy1.example.com:8080",
            "http://user:pass@proxy2.example.com:8080",
            "http://user:pass@proxy3.example.com:8080",
        ]
def process_request(self, request, spider):
        request.meta["proxy"] = random.choice(self.proxies)
def process_exception(self, request, exception, spider):
        # Retry with a different proxy on failure
        request.meta["proxy"] = random.choice(self.proxies)
        return request

Enable in settings.py:

python

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.RandomUserAgentMiddleware": 400,
    "myproject.middlewares.RotatingProxyMiddleware": 350,
}

Settings for Concurrency, Delays, and Retries

Scrapy's settings control how aggressively the spider crawls. These are the most important ones:

python

# settings.py
# Concurrency
CONCURRENT_REQUESTS = 16          # Total concurrent requests (default: 16)
CONCURRENT_REQUESTS_PER_DOMAIN = 8  # Max concurrent requests per domain
CONCURRENT_REQUESTS_PER_IP = 0    # Max per IP (0 = no limit)
# Delays
DOWNLOAD_DELAY = 1.0              # Seconds between requests to same domain
RANDOMIZE_DOWNLOAD_DELAY = True   # Randomize delay (0.5x to 1.5x)
# Retries
RETRY_ENABLED = True
RETRY_TIMES = 3                   # Max retries per request
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Timeouts
DOWNLOAD_TIMEOUT = 30             # Seconds before timeout
# Respect robots.txt
ROBOTSTXT_OBEY = True
# Auto-throttle (adjusts speed based on server load)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Caching (saves responses for development)
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours

The AUTOTHROTTLE settings are particularly useful. They automatically slow down when the server is under load and speed up when it is responsive.

Exporting Data

Scrapy has built-in exporters for common formats:

python

# Command-line export # scrapy crawl products -o products.json # scrapy crawl products -o products.csv # scrapy crawl products -o products.jsonl # JSON Lines (one JSON per line)

# Or configure in settings.py for automatic export FEEDS = { "output/products.json": { "format": "json", "encoding": "utf-8", "overwrite": True, }, "output/products.csv": { "format": "csv", }, }

For databases, use Item Pipelines (shown in the pipeline section above). Common targets: SQLite for local development, PostgreSQL for production, MongoDB for document-oriented data.

scrapy-playwright Integration

For JavaScript-heavy sites, scrapy-playwright combines Scrapy's crawling framework with Playwright's browser rendering:

python

# pip install scrapy-playwright
# playwright install chromium
# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
# spider.py
import scrapy
class JSSpider(scrapy.Spider):
    name = "js_spider"
def start_requests(self):
        yield scrapy.Request(
            "https://example.com/spa-page",
            meta={"playwright": True, "playwright_include_page": True},
        )
async def parse(self, response):
        page = response.meta["playwright_page"]
# You can interact with the page before parsing
        await page.wait_for_selector(".dynamic-content")
# Extract from the rendered HTML
        for card in response.css(".product-card"):
            yield {
                "name": card.css(".title::text").get(),
                "price": card.css(".price::text").get(),
            }
await page.close()

This gives you the best of both worlds: Scrapy's infrastructure (scheduling, pipelines, retries) with Playwright's JavaScript rendering.

Deployment Options

Scrapyd

Self-hosted daemon for running and managing spiders on your own server:

python

# pip install scrapyd scrapyd-client
# scrapyd  (starts the daemon)
# scrapyd-deploy default -p myproject  (deploys your project)
# curl http://localhost:6800/schedule.json -d project=myproject -d spider=products

Scrapy Cloud (Zyte)

Managed platform by Scrapy's creators. Handles infrastructure, scheduling, and monitoring. Best for teams that do not want to manage servers.

Docker

Package your spider in a Docker container and run it anywhere:

python

# Dockerfile
# FROM python:3.11-slim
# WORKDIR /app
# COPY . .
# RUN pip install scrapy
# CMD ["scrapy", "crawl", "products", "-o", "output.json"]

Real-World Example: Crawling an E-Commerce Site

python

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
import re
def clean_price(value):
    match = re.search(r"[\d.]+", value)
    return float(match.group()) if match else 0.0
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    category = scrapy.Field()
    rating = scrapy.Field()
    url = scrapy.Field()
class EcommerceSpider(scrapy.Spider):
    name = "ecommerce"
    start_urls = ["https://example.com/categories"]
custom_settings = {
        "CONCURRENT_REQUESTS": 8,
        "DOWNLOAD_DELAY": 1.5,
        "FEEDS": {"products.jsonl": {"format": "jsonlines"}},
    }
def parse(self, response):
        """Parse category listing page."""
        for cat_link in response.css(".category-list a::attr(href)").getall():
            yield response.follow(cat_link, self.parse_category)
def parse_category(self, response):
        """Parse products in a category."""
        category = response.css("h1::text").get("Unknown")
for card in response.css(".product-card"):
            loader = ItemLoader(item=ProductItem(), selector=card)
            loader.default_output_processor = TakeFirst()
loader.add_css("name", ".title::text")
            loader.add_css("price", ".price::text", MapCompose(clean_price))
            loader.add_value("category", category)
            loader.add_css("rating", ".stars::attr(data-rating)")
            loader.add_css("url", "a::attr(href)")
yield loader.load_item()
# Pagination
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse_category)

When to Use Scrapy vs. Alternatives

Scenario	Best Tool	Why
Crawling 10,000+ pages	Scrapy	Built-in concurrency, scheduling, pipelines
Simple one-off script	BeautifulSoup + requests	Less boilerplate, faster to write
JavaScript-rendered pages	Playwright (or scrapy-playwright)	Needs a browser engine
Login-protected site with JS	Playwright	Browser handles cookies and JS
Recurring production scraping	Scrapy	Built for deployment and monitoring
Quick data extraction	BeautifulSoup	Minimal setup

Scrapy shines when you need structure, scale, and reliability. For small scripts, it adds unnecessary complexity. The sweet spot: if your scraping task involves more than a few hundred pages or needs to run on a schedule, Scrapy is probably the right choice.

Next Steps

7.Install Scrapy: pip install scrapy
8.Create a project: scrapy startproject myproject
9.Generate a spider: scrapy genspider products example.com
10.Run the Scrapy shell to test selectors: scrapy shell "https://example.com"
11.Build your first spider with pagination
12.Add an Item Pipeline to clean and store data
13.When you need JavaScript support, add scrapy-playwright

What Is Scrapy? Python Web Crawling Framework Explained

How Scrapy Works

Architecture Deep Dive

Creating a Project and First Spider

Selectors: CSS and XPath

Following Links and Pagination

Item Pipelines

Middleware for Proxies and User Agents

Settings for Concurrency, Delays, and Retries

Exporting Data

scrapy-playwright Integration

Deployment Options

Scrapyd

Scrapy Cloud (Zyte)

Docker

Real-World Example: Crawling an E-Commerce Site

When to Use Scrapy vs. Alternatives

Next Steps

Related Terms

Web Crawling

BeautifulSoup

Pagination

Data Pipeline

Related Articles

BeautifulSoup vs Playwright vs Scrapy: Which Should You Use?

How to Handle Pagination in Web Scraping (5 Patterns)

Tool Comparisons

Learn Scrapy hands-on