What Is Playwright? Browser Automation for Web Scraping
Playwright is a browser automation framework developed by Microsoft that controls real browsers (Chromium, Firefox, WebKit) programmatically. For web scraping, it's used to extract data from JavaScript-heavy websites that don't render content in the initial HTML.
Why Browser Automation for Scraping?
Many modern websites are Single Page Applications (SPAs) built with React, Vue, or Angular. When you fetch these pages with requests, you get an empty HTML shell. The actual data loads after JavaScript executes. Playwright solves this by running a real browser engine that processes JavaScript, renders the DOM, and gives you access to the fully loaded page.
Beyond JavaScript rendering, browser automation lets you interact with pages the way a human does: clicking buttons, filling forms, scrolling to load more content, and handling login flows. If the data you need requires any of these interactions, Playwright is your tool.
Installation
Playwright requires both the Python package and browser binaries:
# Install the Python package
pip install playwright
# Download browser binaries (Chromium, Firefox, WebKit)
playwright install
# Or install only Chromium (smaller download, most common for scraping)
playwright install chromium
The browser download is a one-time step. It installs actual browser binaries (not just drivers like Selenium). This is why Playwright is more reliable: it bundles a specific browser version that is guaranteed to work with the library version.
Sync vs. Async API
Playwright offers both synchronous and asynchronous APIs. The sync API is simpler and fine for most scraping scripts. The async API is better when you need to scrape multiple pages concurrently.
Synchronous API
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
page.wait_for_selector(".product-card")
products = page.query_selector_all(".product-card")
for product in products:
name = product.query_selector(".title").inner_text()
price = product.query_selector(".price").inner_text()
print(f"{name}: {price}")
browser.close()
Asynchronous API
import asyncio
from playwright.async_api import async_playwright
async def scrape():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://example.com/products")
await page.wait_for_selector(".product-card")
products = await page.query_selector_all(".product-card")
for product in products:
name = await (await product.query_selector(".title")).inner_text()
price = await (await product.query_selector(".price")).inner_text()
print(f"{name}: {price}")
await browser.close()
asyncio.run(scrape())
Use the sync API for simple scripts and the async API when you need to manage multiple browser contexts or pages simultaneously.
Core Scraping Patterns
goto, wait, and extract
The fundamental pattern: navigate to a page, wait for content to load, then extract data.
page.goto("https://example.com", wait_until="networkidle")
# wait_until options:
# "load" - wait for load event (default)
# "domcontentloaded" - wait for DOMContentLoaded
# "networkidle" - wait until no network requests for 500ms (most reliable for SPAs)
# Wait for a specific element
page.wait_for_selector(".product-card", timeout=10000)
# Extract from all matching elements
cards = page.query_selector_all(".product-card")
data = []
for card in cards:
data.append({
"name": card.query_selector(".title").inner_text(),
"price": card.query_selector(".price").inner_text(),
"link": card.get_attribute("href"),
})
Using evaluate for Complex Extraction
When query_selector is not enough, run JavaScript directly in the page context:
# Extract data using JavaScript
data = page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
inStock: !card.classList.contains('out-of-stock')
}));
}
""")
Network Interception
This is one of Playwright's most powerful scraping features. Many SPAs fetch data from internal APIs via XHR or fetch requests. Instead of parsing the rendered HTML, you can intercept these API responses directly and get clean JSON.
from playwright.sync_api import sync_playwright
import json
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Capture API responses
api_data = []
def handle_response(response):
if "/api/products" in response.url and response.status == 200:
data = response.json()
api_data.extend(data["items"])
page.on("response", handle_response)
page.goto("https://example.com/products")
page.wait_for_load_state("networkidle")
# api_data now contains the raw JSON from the API
print(f"Captured {len(api_data)} products from API")
for item in api_data:
print(f"{item['name']}: {item['price']}")
browser.close()
This approach is often better than HTML parsing because:
- •API responses are structured JSON (no messy HTML to parse)
- •Data might include fields not shown on the page
- •It is more resilient to UI changes
Handling Infinite Scroll and Lazy Loading
Many modern sites use infinite scroll instead of pagination. You need to scroll down to trigger loading of more content.
def scroll_to_bottom(page, max_scrolls=20, pause=1.5):
"""Scroll down until no new content loads."""
previous_height = 0
for i in range(max_scrolls):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(int(pause * 1000))
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break # No new content loaded
previous_height = current_height
print(f"Scroll {i+1}: page height = {current_height}")
# Usage
page.goto("https://example.com/feed")
scroll_to_bottom(page, max_scrolls=10)
items = page.query_selector_all(".feed-item")
Taking Screenshots for Debugging
Screenshots are invaluable when your scraper does not find the expected elements.
# Full page screenshot
page.screenshot(path="debug_full.png", full_page=True)
# Screenshot of a specific element
element = page.query_selector(".product-grid")
element.screenshot(path="debug_grid.png")
# Screenshot on error
try:
page.wait_for_selector(".product-card", timeout=5000)
except:
page.screenshot(path="error_state.png")
print("Element not found - check error_state.png")
Managing Browser Context and Cookies
Browser contexts let you run isolated sessions within a single browser instance. Each context has its own cookies, localStorage, and cache.
browser = p.chromium.launch()
# Create a context with custom settings
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
locale="en-US",
)
# Login and save cookies for reuse
page = context.new_page()
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "password123")
page.click("button[type=submit]")
page.wait_for_url("**/dashboard")
# Save authentication state
context.storage_state(path="auth.json")
browser.close()
# Later: reuse the saved session
context = browser.new_context(storage_state="auth.json")
page = context.new_page()
page.goto("https://example.com/dashboard") # Already logged in
Stealth Mode and Anti-Detection
Headless browsers leave fingerprints that anti-bot systems detect. Common tells include the navigator.webdriver property, missing browser plugins, and specific HTTP header patterns.
# Basic stealth: launch with headed mode (slower but harder to detect)
browser = p.chromium.launch(headless=False)
# Use playwright-stealth for better evasion
# pip install playwright-stealth
from playwright_stealth import stealth_sync
page = browser.new_page()
stealth_sync(page) # Patches common detection vectors
page.goto("https://example.com")
Additional anti-detection tips:
- •Set a realistic viewport size (not the default 800x600)
- •Set a real user agent string
- •Add random delays between actions
- •Move the mouse and scroll naturally before extracting data
- •Use residential proxies for the browser connection
Performance Optimization
Browser automation is inherently slower than HTTP requests. These techniques help:
# Block images, CSS, and fonts to speed up page loads
def block_resources(route):
if route.request.resource_type in ["image", "stylesheet", "font", "media"]:
route.abort()
else:
route.continue_()
page.route("**/*", block_resources)
# Or block specific domains (ad networks, analytics)
page.route("/*.google-analytics.com/", lambda route: route.abort())
page.route("/*.doubleclick.net/", lambda route: route.abort())
# Reuse browser instances across pages (don't launch/close per page)
browser = p.chromium.launch()
page = browser.new_page()
for url in urls:
page.goto(url, wait_until="domcontentloaded")
# extract data...
# No need to create a new page each time
browser.close()
Real-World Example: Scraping a JS-Heavy SPA
This complete example scrapes product data from a React-based e-commerce site, handling dynamic loading and pagination:
from playwright.sync_api import sync_playwright
import json
import time
def scrape_spa_products(base_url, max_pages=10):
products = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
page = context.new_page()
# Block unnecessary resources
page.route("**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}",
lambda route: route.abort())
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
page.goto(url, wait_until="networkidle")
# Wait for product cards to render
try:
page.wait_for_selector(".product-card", timeout=8000)
except:
print(f"No products on page {page_num}, stopping")
break
# Extract using JavaScript for speed
page_products = page.evaluate("""
() => Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.name')?.innerText,
price: card.querySelector('.price')?.innerText,
rating: card.querySelector('.stars')?.getAttribute('data-rating'),
url: card.querySelector('a')?.href,
}))
""")
products.extend(page_products)
print(f"Page {page_num}: {len(page_products)} products")
time.sleep(1) # Respectful delay
browser.close()
# Save results
with open("products.json", "w") as f:
json.dump(products, f, indent=2)
return products
results = scrape_spa_products("https://example.com/shop")
print(f"Total: {len(results)} products scraped")
Playwright vs. Selenium vs. Puppeteer
| Feature | Playwright | Selenium | Puppeteer |
|---|---|---|---|
| Language support | Python, JS, C#, Java | Many (Python, Java, JS, C#, Ruby) | JavaScript/TypeScript only |
| Browser support | Chromium, Firefox, WebKit | Chrome, Firefox, Edge, Safari | Chromium only |
| Auto-wait | Built-in | Manual waits needed | Partial |
| Speed | Fast | Slower | Fast |
| Network interception | Full support | Limited | Full support |
| Anti-detection | Good (with stealth plugin) | Poor (easily detected) | Good (with stealth plugin) |
| Setup complexity | Simple (bundled browsers) | Complex (separate drivers) | Simple |
| Community & docs | Growing fast | Largest (oldest tool) | Large (Node.js ecosystem) |
| Maintenance | Microsoft-backed | Selenium HQ | Google-backed |
| Best for | Python scraping, modern sites | Legacy projects, multi-language | Node.js projects |
Next Steps
- 1.Install Playwright:
pip install playwright && playwright install chromium - 2.Start with a simple script that navigates to a page and extracts text
- 3.Open the Network tab in DevTools to see what API calls the page makes. Try intercepting those instead of parsing HTML.
- 4.Add resource blocking to speed up your scripts
- 5.When you need to scale beyond single-browser scraping, look into running multiple browser contexts or integrating with Scrapy via scrapy-playwright