What Is Playwright? Browser Automation for Web Scraping

intermediate

Playwright is a browser automation framework developed by Microsoft that controls real browsers (Chromium, Firefox, WebKit) programmatically. For web scraping, it's used to extract data from JavaScript-heavy websites that don't render content in the initial HTML.

Why Browser Automation for Scraping?

Many modern websites are Single Page Applications (SPAs) built with React, Vue, or Angular. When you fetch these pages with requests, you get an empty HTML shell. The actual data loads after JavaScript executes. Playwright solves this by running a real browser engine that processes JavaScript, renders the DOM, and gives you access to the fully loaded page.

Beyond JavaScript rendering, browser automation lets you interact with pages the way a human does: clicking buttons, filling forms, scrolling to load more content, and handling login flows. If the data you need requires any of these interactions, Playwright is your tool.

Installation

Playwright requires both the Python package and browser binaries:

python

# Install the Python package pip install playwright # Download browser binaries (Chromium, Firefox, WebKit) playwright install

# Or install only Chromium (smaller download, most common for scraping) playwright install chromium

The browser download is a one-time step. It installs actual browser binaries (not just drivers like Selenium). This is why Playwright is more reliable: it bundles a specific browser version that is guaranteed to work with the library version.

Sync vs. Async API

Playwright offers both synchronous and asynchronous APIs. The sync API is simpler and fine for most scraping scripts. The async API is better when you need to scrape multiple pages concurrently.

Synchronous API

python

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")
    page.wait_for_selector(".product-card")
products = page.query_selector_all(".product-card")
    for product in products:
        name = product.query_selector(".title").inner_text()
        price = product.query_selector(".price").inner_text()
        print(f"{name}: {price}")
browser.close()

Asynchronous API

python

import asyncio
from playwright.async_api import async_playwright
async def scrape():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com/products")
        await page.wait_for_selector(".product-card")
products = await page.query_selector_all(".product-card")
        for product in products:
            name = await (await product.query_selector(".title")).inner_text()
            price = await (await product.query_selector(".price")).inner_text()
            print(f"{name}: {price}")
await browser.close()
asyncio.run(scrape())

Use the sync API for simple scripts and the async API when you need to manage multiple browser contexts or pages simultaneously.

Core Scraping Patterns

goto, wait, and extract

The fundamental pattern: navigate to a page, wait for content to load, then extract data.

python

page.goto("https://example.com", wait_until="networkidle")
# wait_until options:
#   "load" - wait for load event (default)
#   "domcontentloaded" - wait for DOMContentLoaded
#   "networkidle" - wait until no network requests for 500ms (most reliable for SPAs)
# Wait for a specific element
page.wait_for_selector(".product-card", timeout=10000)
# Extract from all matching elements
cards = page.query_selector_all(".product-card")
data = []
for card in cards:
    data.append({
        "name": card.query_selector(".title").inner_text(),
        "price": card.query_selector(".price").inner_text(),
        "link": card.get_attribute("href"),
    })

Using evaluate for Complex Extraction

When query_selector is not enough, run JavaScript directly in the page context:

python

# Extract data using JavaScript
data = page.evaluate("""
    () => {
        return Array.from(document.querySelectorAll('.product-card')).map(card => ({
            name: card.querySelector('.title')?.textContent?.trim(),
            price: card.querySelector('.price')?.textContent?.trim(),
            inStock: !card.classList.contains('out-of-stock')
        }));
    }
""")

Network Interception

This is one of Playwright's most powerful scraping features. Many SPAs fetch data from internal APIs via XHR or fetch requests. Instead of parsing the rendered HTML, you can intercept these API responses directly and get clean JSON.

python

from playwright.sync_api import sync_playwright
import json
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
# Capture API responses
    api_data = []
def handle_response(response):
        if "/api/products" in response.url and response.status == 200:
            data = response.json()
            api_data.extend(data["items"])
page.on("response", handle_response)
page.goto("https://example.com/products")
    page.wait_for_load_state("networkidle")
# api_data now contains the raw JSON from the API
    print(f"Captured {len(api_data)} products from API")
    for item in api_data:
        print(f"{item['name']}: {item['price']}")
browser.close()

This approach is often better than HTML parsing because:

•API responses are structured JSON (no messy HTML to parse)
•Data might include fields not shown on the page
•It is more resilient to UI changes

Handling Infinite Scroll and Lazy Loading

Many modern sites use infinite scroll instead of pagination. You need to scroll down to trigger loading of more content.

python

def scroll_to_bottom(page, max_scrolls=20, pause=1.5):
    """Scroll down until no new content loads."""
    previous_height = 0
    for i in range(max_scrolls):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(int(pause * 1000))
current_height = page.evaluate("document.body.scrollHeight")
        if current_height == previous_height:
            break  # No new content loaded
        previous_height = current_height
        print(f"Scroll {i+1}: page height = {current_height}")
# Usage
page.goto("https://example.com/feed")
scroll_to_bottom(page, max_scrolls=10)
items = page.query_selector_all(".feed-item")

Taking Screenshots for Debugging

Screenshots are invaluable when your scraper does not find the expected elements.

python

# Full page screenshot
page.screenshot(path="debug_full.png", full_page=True)
# Screenshot of a specific element
element = page.query_selector(".product-grid")
element.screenshot(path="debug_grid.png")
# Screenshot on error
try:
    page.wait_for_selector(".product-card", timeout=5000)
except:
    page.screenshot(path="error_state.png")
    print("Element not found - check error_state.png")

Managing Browser Context and Cookies

Browser contexts let you run isolated sessions within a single browser instance. Each context has its own cookies, localStorage, and cache.

python

browser = p.chromium.launch()
# Create a context with custom settings
context = browser.new_context(
    viewport={"width": 1920, "height": 1080},
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
    locale="en-US",
)
# Login and save cookies for reuse
page = context.new_page()
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "password123")
page.click("button[type=submit]")
page.wait_for_url("**/dashboard")
# Save authentication state
context.storage_state(path="auth.json")
browser.close()
# Later: reuse the saved session
context = browser.new_context(storage_state="auth.json")
page = context.new_page()
page.goto("https://example.com/dashboard")  # Already logged in

Stealth Mode and Anti-Detection

Headless browsers leave fingerprints that anti-bot systems detect. Common tells include the navigator.webdriver property, missing browser plugins, and specific HTTP header patterns.

python

# Basic stealth: launch with headed mode (slower but harder to detect)
browser = p.chromium.launch(headless=False)
# Use playwright-stealth for better evasion
# pip install playwright-stealth
from playwright_stealth import stealth_sync
page = browser.new_page()
stealth_sync(page)  # Patches common detection vectors
page.goto("https://example.com")

Additional anti-detection tips:

•Set a realistic viewport size (not the default 800x600)
•Set a real user agent string
•Add random delays between actions
•Move the mouse and scroll naturally before extracting data
•Use residential proxies for the browser connection

Performance Optimization

Browser automation is inherently slower than HTTP requests. These techniques help:

python

# Block images, CSS, and fonts to speed up page loads
def block_resources(route):
    if route.request.resource_type in ["image", "stylesheet", "font", "media"]:
        route.abort()
    else:
        route.continue_()
page.route("**/*", block_resources)
# Or block specific domains (ad networks, analytics)
page.route("/*.google-analytics.com/", lambda route: route.abort())
page.route("/*.doubleclick.net/", lambda route: route.abort())

python

# Reuse browser instances across pages (don't launch/close per page)
browser = p.chromium.launch()
page = browser.new_page()
for url in urls:
    page.goto(url, wait_until="domcontentloaded")
    # extract data...
    # No need to create a new page each time
browser.close()

Real-World Example: Scraping a JS-Heavy SPA

This complete example scrapes product data from a React-based e-commerce site, handling dynamic loading and pagination:

python

from playwright.sync_api import sync_playwright
import json
import time
def scrape_spa_products(base_url, max_pages=10):
    products = []
with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        page = context.new_page()
# Block unnecessary resources
        page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}",
                    lambda route: route.abort())
for page_num in range(1, max_pages + 1):
            url = f"{base_url}?page={page_num}"
            page.goto(url, wait_until="networkidle")
# Wait for product cards to render
            try:
                page.wait_for_selector(".product-card", timeout=8000)
            except:
                print(f"No products on page {page_num}, stopping")
                break
# Extract using JavaScript for speed
            page_products = page.evaluate("""
                () => Array.from(document.querySelectorAll('.product-card')).map(card => ({
                    name: card.querySelector('.name')?.innerText,
                    price: card.querySelector('.price')?.innerText,
                    rating: card.querySelector('.stars')?.getAttribute('data-rating'),
                    url: card.querySelector('a')?.href,
                }))
            """)
products.extend(page_products)
            print(f"Page {page_num}: {len(page_products)} products")
            time.sleep(1)  # Respectful delay
browser.close()
# Save results
    with open("products.json", "w") as f:
        json.dump(products, f, indent=2)
return products
results = scrape_spa_products("https://example.com/shop")
print(f"Total: {len(results)} products scraped")

Playwright vs. Selenium vs. Puppeteer

Feature	Playwright	Selenium	Puppeteer
Language support	Python, JS, C#, Java	Many (Python, Java, JS, C#, Ruby)	JavaScript/TypeScript only
Browser support	Chromium, Firefox, WebKit	Chrome, Firefox, Edge, Safari	Chromium only
Auto-wait	Built-in	Manual waits needed	Partial
Speed	Fast	Slower	Fast
Network interception	Full support	Limited	Full support
Anti-detection	Good (with stealth plugin)	Poor (easily detected)	Good (with stealth plugin)
Setup complexity	Simple (bundled browsers)	Complex (separate drivers)	Simple
Community & docs	Growing fast	Largest (oldest tool)	Large (Node.js ecosystem)
Maintenance	Microsoft-backed	Selenium HQ	Google-backed
Best for	Python scraping, modern sites	Legacy projects, multi-language	Node.js projects

For Python web scraping in 2025, Playwright is the clear winner. It has the best combination of speed, reliability, and developer experience. Selenium is only worth considering if you are maintaining an existing codebase that already uses it.

Next Steps

1.Install Playwright: pip install playwright && playwright install chromium
2.Start with a simple script that navigates to a page and extracts text
3.Open the Network tab in DevTools to see what API calls the page makes. Try intercepting those instead of parsing HTML.
4.Add resource blocking to speed up your scripts
5.When you need to scale beyond single-browser scraping, look into running multiple browser contexts or integrating with Scrapy via scrapy-playwright

What Is Playwright? Browser Automation for Web Scraping

Why Browser Automation for Scraping?

Installation

Sync vs. Async API

Synchronous API

Asynchronous API

Core Scraping Patterns

goto, wait, and extract

Using evaluate for Complex Extraction

Network Interception

Handling Infinite Scroll and Lazy Loading

Taking Screenshots for Debugging

Managing Browser Context and Cookies

Stealth Mode and Anti-Detection

Performance Optimization

Real-World Example: Scraping a JS-Heavy SPA

Playwright vs. Selenium vs. Puppeteer

Next Steps

Related Terms

Headless Browser

Selenium

JavaScript Rendering

BeautifulSoup

Related Articles

How to Scrape Dynamic Websites with Playwright in Python

BeautifulSoup vs Playwright vs Scrapy: Which Should You Use?

Tool Comparisons

Learn Playwright hands-on