What Is API Scraping? Extracting Data from Hidden APIs
API scraping is the technique of identifying and using the internal APIs that websites use to load data, then calling those APIs directly instead of parsing HTML. It's faster, more reliable, and returns structured data.
Why API Scraping Is Superior
Most modern websites fetch data from backend APIs using JavaScript. Instead of rendering the page and parsing HTML, you can call these APIs directly:
| Approach | Speed | Reliability | Data Format |
|---|---|---|---|
| HTML scraping | Slow | Breaks often | Unstructured |
| API scraping | Fast | More stable | JSON (structured) |
| Headless browser | Slowest | Most fragile | Unstructured |
How to Find Hidden APIs
- 1.Open DevTools (F12) → Network tab
- 2.Filter by XHR/Fetch requests
- 3.Browse the site normally and watch for API calls
- 4.Click on requests to see the URL, headers, and response
Found: GET https://api.example.com/v1/products?category=electronics&page=1
Response: {"products": [{"name": "Widget", "price": 9.99}, ...], "total": 250}
Using the API Directly
import requests
headers = {
"User-Agent": "Mozilla/5.0 ...",
"Accept": "application/json",
"Authorization": "Bearer eyJ...", # if required
}
response = requests.get(
"https://api.example.com/v1/products",
params={"category": "electronics", "page": 1},
headers=headers,
)
data = response.json()
for product in data["products"]:
print(f"{product['name']}: ${product['price']}")
Common API Patterns
- •REST APIs: Standard endpoints with query parameters
- •GraphQL: Single endpoint, query language in the POST body
- •Paginated responses:
page,offset,cursorparameters - •Authentication: Bearer tokens, API keys, session cookies
Challenges
- •APIs may require authentication tokens that expire
- •Some APIs are rate-limited more aggressively than web pages
- •API endpoints can change without notice
- •Some sites encrypt or obfuscate API payloads
Pro Tip: Playwright Network Interception
When APIs are hard to call directly, use Playwright to capture them:
responses = []
page.on("response", lambda r: responses.append(r) if "/api/" in r.url else None)
page.goto("https://example.com/products")
# responses now contains all API calls the page made