Skip to main content

What Is Playwright? Browser Automation for Web Scraping

intermediate

Playwright is a browser automation framework developed by Microsoft that controls real browsers (Chromium, Firefox, WebKit) programmatically. For web scraping, it's used to extract data from JavaScript-heavy websites that don't render content in the initial HTML.

Why Browser Automation for Scraping?

Many modern websites are Single Page Applications (SPAs) built with React, Vue, or Angular. When you fetch these pages with requests, you get an empty HTML shell. The actual data loads after JavaScript executes. Playwright solves this by running a real browser engine that processes JavaScript, renders the DOM, and gives you access to the fully loaded page.

Beyond JavaScript rendering, browser automation lets you interact with pages the way a human does: clicking buttons, filling forms, scrolling to load more content, and handling login flows. If the data you need requires any of these interactions, Playwright is your tool.

Installation

Playwright requires both the Python package and browser binaries:

python
# Install the Python package
pip install playwright

# Download browser binaries (Chromium, Firefox, WebKit) playwright install

# Or install only Chromium (smaller download, most common for scraping) playwright install chromium

The browser download is a one-time step. It installs actual browser binaries (not just drivers like Selenium). This is why Playwright is more reliable: it bundles a specific browser version that is guaranteed to work with the library version.

Sync vs. Async API

Playwright offers both synchronous and asynchronous APIs. The sync API is simpler and fine for most scraping scripts. The async API is better when you need to scrape multiple pages concurrently.

Synchronous API

python
from playwright.sync_api import sync_playwright

with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto("https://example.com/products") page.wait_for_selector(".product-card")

products = page.query_selector_all(".product-card") for product in products: name = product.query_selector(".title").inner_text() price = product.query_selector(".price").inner_text() print(f"{name}: {price}")

browser.close()

Asynchronous API

python
import asyncio
from playwright.async_api import async_playwright

async def scrape(): async with async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page() await page.goto("https://example.com/products") await page.wait_for_selector(".product-card")

products = await page.query_selector_all(".product-card") for product in products: name = await (await product.query_selector(".title")).inner_text() price = await (await product.query_selector(".price")).inner_text() print(f"{name}: {price}")

await browser.close()

asyncio.run(scrape())

Use the sync API for simple scripts and the async API when you need to manage multiple browser contexts or pages simultaneously.

Core Scraping Patterns

goto, wait, and extract

The fundamental pattern: navigate to a page, wait for content to load, then extract data.

python
page.goto("https://example.com", wait_until="networkidle")
# wait_until options:
#   "load" - wait for load event (default)
#   "domcontentloaded" - wait for DOMContentLoaded
#   "networkidle" - wait until no network requests for 500ms (most reliable for SPAs)

# Wait for a specific element page.wait_for_selector(".product-card", timeout=10000)

# Extract from all matching elements cards = page.query_selector_all(".product-card") data = [] for card in cards: data.append({ "name": card.query_selector(".title").inner_text(), "price": card.query_selector(".price").inner_text(), "link": card.get_attribute("href"), })

Using evaluate for Complex Extraction

When query_selector is not enough, run JavaScript directly in the page context:

python
# Extract data using JavaScript
data = page.evaluate("""
    () => {
        return Array.from(document.querySelectorAll('.product-card')).map(card => ({
            name: card.querySelector('.title')?.textContent?.trim(),
            price: card.querySelector('.price')?.textContent?.trim(),
            inStock: !card.classList.contains('out-of-stock')
        }));
    }
""")

Network Interception

This is one of Playwright's most powerful scraping features. Many SPAs fetch data from internal APIs via XHR or fetch requests. Instead of parsing the rendered HTML, you can intercept these API responses directly and get clean JSON.

python
from playwright.sync_api import sync_playwright
import json

with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page()

# Capture API responses api_data = []

def handle_response(response): if "/api/products" in response.url and response.status == 200: data = response.json() api_data.extend(data["items"])

page.on("response", handle_response)

page.goto("https://example.com/products") page.wait_for_load_state("networkidle")

# api_data now contains the raw JSON from the API print(f"Captured {len(api_data)} products from API") for item in api_data: print(f"{item['name']}: {item['price']}")

browser.close()

This approach is often better than HTML parsing because:

  • API responses are structured JSON (no messy HTML to parse)
  • Data might include fields not shown on the page
  • It is more resilient to UI changes

Handling Infinite Scroll and Lazy Loading

Many modern sites use infinite scroll instead of pagination. You need to scroll down to trigger loading of more content.

python
def scroll_to_bottom(page, max_scrolls=20, pause=1.5):
    """Scroll down until no new content loads."""
    previous_height = 0
    for i in range(max_scrolls):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(int(pause * 1000))

current_height = page.evaluate("document.body.scrollHeight") if current_height == previous_height: break # No new content loaded previous_height = current_height print(f"Scroll {i+1}: page height = {current_height}")

# Usage page.goto("https://example.com/feed") scroll_to_bottom(page, max_scrolls=10) items = page.query_selector_all(".feed-item")

Taking Screenshots for Debugging

Screenshots are invaluable when your scraper does not find the expected elements.

python
# Full page screenshot
page.screenshot(path="debug_full.png", full_page=True)

# Screenshot of a specific element element = page.query_selector(".product-grid") element.screenshot(path="debug_grid.png")

# Screenshot on error try: page.wait_for_selector(".product-card", timeout=5000) except: page.screenshot(path="error_state.png") print("Element not found - check error_state.png")

Managing Browser Context and Cookies

Browser contexts let you run isolated sessions within a single browser instance. Each context has its own cookies, localStorage, and cache.

python
browser = p.chromium.launch()

# Create a context with custom settings context = browser.new_context( viewport={"width": 1920, "height": 1080}, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...", locale="en-US", )

# Login and save cookies for reuse page = context.new_page() page.goto("https://example.com/login") page.fill("#email", "user@example.com") page.fill("#password", "password123") page.click("button[type=submit]") page.wait_for_url("**/dashboard")

# Save authentication state context.storage_state(path="auth.json") browser.close()

# Later: reuse the saved session context = browser.new_context(storage_state="auth.json") page = context.new_page() page.goto("https://example.com/dashboard") # Already logged in

Stealth Mode and Anti-Detection

Headless browsers leave fingerprints that anti-bot systems detect. Common tells include the navigator.webdriver property, missing browser plugins, and specific HTTP header patterns.

python
# Basic stealth: launch with headed mode (slower but harder to detect)
browser = p.chromium.launch(headless=False)

# Use playwright-stealth for better evasion # pip install playwright-stealth from playwright_stealth import stealth_sync

page = browser.new_page() stealth_sync(page) # Patches common detection vectors

page.goto("https://example.com")

Additional anti-detection tips:

  • Set a realistic viewport size (not the default 800x600)
  • Set a real user agent string
  • Add random delays between actions
  • Move the mouse and scroll naturally before extracting data
  • Use residential proxies for the browser connection

Performance Optimization

Browser automation is inherently slower than HTTP requests. These techniques help:

python
# Block images, CSS, and fonts to speed up page loads
def block_resources(route):
    if route.request.resource_type in ["image", "stylesheet", "font", "media"]:
        route.abort()
    else:
        route.continue_()

page.route("**/*", block_resources)

# Or block specific domains (ad networks, analytics) page.route("/*.google-analytics.com/", lambda route: route.abort()) page.route("/*.doubleclick.net/", lambda route: route.abort())

python
# Reuse browser instances across pages (don't launch/close per page)
browser = p.chromium.launch()
page = browser.new_page()

for url in urls: page.goto(url, wait_until="domcontentloaded") # extract data... # No need to create a new page each time

browser.close()

Real-World Example: Scraping a JS-Heavy SPA

This complete example scrapes product data from a React-based e-commerce site, handling dynamic loading and pagination:

python
from playwright.sync_api import sync_playwright
import json
import time

def scrape_spa_products(base_url, max_pages=10): products = []

with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={"width": 1920, "height": 1080}, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" ) page = context.new_page()

# Block unnecessary resources page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}", lambda route: route.abort())

for page_num in range(1, max_pages + 1): url = f"{base_url}?page={page_num}" page.goto(url, wait_until="networkidle")

# Wait for product cards to render try: page.wait_for_selector(".product-card", timeout=8000) except: print(f"No products on page {page_num}, stopping") break

# Extract using JavaScript for speed page_products = page.evaluate(""" () => Array.from(document.querySelectorAll('.product-card')).map(card => ({ name: card.querySelector('.name')?.innerText, price: card.querySelector('.price')?.innerText, rating: card.querySelector('.stars')?.getAttribute('data-rating'), url: card.querySelector('a')?.href, })) """)

products.extend(page_products) print(f"Page {page_num}: {len(page_products)} products") time.sleep(1) # Respectful delay

browser.close()

# Save results with open("products.json", "w") as f: json.dump(products, f, indent=2)

return products

results = scrape_spa_products("https://example.com/shop") print(f"Total: {len(results)} products scraped")

Playwright vs. Selenium vs. Puppeteer

FeaturePlaywrightSeleniumPuppeteer
Language supportPython, JS, C#, JavaMany (Python, Java, JS, C#, Ruby)JavaScript/TypeScript only
Browser supportChromium, Firefox, WebKitChrome, Firefox, Edge, SafariChromium only
Auto-waitBuilt-inManual waits neededPartial
SpeedFastSlowerFast
Network interceptionFull supportLimitedFull support
Anti-detectionGood (with stealth plugin)Poor (easily detected)Good (with stealth plugin)
Setup complexitySimple (bundled browsers)Complex (separate drivers)Simple
Community & docsGrowing fastLargest (oldest tool)Large (Node.js ecosystem)
MaintenanceMicrosoft-backedSelenium HQGoogle-backed
Best forPython scraping, modern sitesLegacy projects, multi-languageNode.js projects
For Python web scraping in 2025, Playwright is the clear winner. It has the best combination of speed, reliability, and developer experience. Selenium is only worth considering if you are maintaining an existing codebase that already uses it.

Next Steps

  1. 1.Install Playwright: pip install playwright && playwright install chromium
  2. 2.Start with a simple script that navigates to a page and extracts text
  3. 3.Open the Network tab in DevTools to see what API calls the page makes. Try intercepting those instead of parsing HTML.
  4. 4.Add resource blocking to speed up your scripts
  5. 5.When you need to scale beyond single-browser scraping, look into running multiple browser contexts or integrating with Scrapy via scrapy-playwright

Learn Playwright hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use playwright in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you