Scraping Paginated Websites: Complete Guide
Pagination is a web design pattern that splits large sets of content across multiple pages. When scraping, handling pagination means automatically navigating through all pages to collect the complete dataset.
Types of Pagination
1. URL-Based Pagination
The simplest type — page numbers or offsets in the URL.https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?offset=0&limit=20
https://example.com/products?offset=20&limit=20
import requests
from bs4 import BeautifulSoup
all_products = []
for page in range(1, 50):
response = requests.get(f"https://example.com/products?page={page}")
soup = BeautifulSoup(response.text, "lxml")
products = soup.select(".product-card")
if not products:
break # no more pages
all_products.extend(products)
2. Next-Button Pagination
Follow "Next" links until there are none.url = "https://example.com/products"
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
# Extract data...
next_link = soup.select_one("a.next-page")
url = next_link["href"] if next_link else None
3. Infinite Scroll
Content loads as you scroll down — requires JavaScript or API interception.# Usually these sites use an API endpoint
# Check the Network tab in DevTools for API calls
import requests
page = 1
while True:
response = requests.get(f"https://api.example.com/products?page={page}")
data = response.json()
if not data["results"]:
break
process(data["results"])
page += 1
4. Load More Button
Similar to infinite scroll, but triggered by a button click. Same API interception approach works.Common Pagination Pitfalls
- •Missing last pages: Always verify you've reached the end
- •Duplicate data: Some sites return the last page repeatedly — check for duplicates
- •Rate limiting: Paginated scraping means many requests — add delays
- •Changing data: If the site updates while you scrape, you might miss or duplicate items