How to Handle Pagination in Web Scraping (5 Patterns)
Pagination is one of the first real obstacles you'll hit when scraping. A single page only shows a slice of the data — the rest is spread across dozens or hundreds of pages. You need to handle this systematically.
Here are five pagination patterns you'll encounter and how to handle each one.
Pattern 1: URL-Based Page Numbers
The simplest and most common pattern. The page number is right in the URL.
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
import requests
from bs4 import BeautifulSoup
def scrape_numbered_pages(base_url, max_pages=50):
all_items = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product-card")
if not items:
break # No more results — stop
for item in items:
all_items.append({
"name": item.select_one(".name").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
})
print(f"Page {page}: {len(items)} items")
return all_items
products = scrape_numbered_pages("https://example.com/products")
The key is detecting the last page. Options: check for an empty result set, look for a "next" button that's disabled, or parse the total page count from the page.
Pattern 2: Next Button Following
Some sites don't use predictable URL patterns. Instead, each page has a "Next" link pointing to the next page. You follow the chain.
import requests
from bs4 import BeautifulSoup
def scrape_with_next_button(start_url):
all_items = []
url = start_url
while url:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# Extract items from current page
for item in soup.select(".product-card"):
all_items.append({
"name": item.select_one(".name").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
})
# Find the next page link
next_link = soup.select_one("a.next-page")
if next_link and next_link.get("href"):
url = next_link["href"]
# Handle relative URLs
if url.startswith("/"):
url = "https://example.com" + url
else:
url = None # No next button — we're done
print(f"Scraped {len(all_items)} items so far")
return all_items
Watch out for relative URLs. The href might be /products?page=3 instead of a full URL.
Pattern 3: Infinite Scroll with Playwright
No page numbers, no next button — content loads as you scroll down. Instagram, Twitter, and many e-commerce sites use this.
from playwright.sync_api import sync_playwright
def scrape_infinite_scroll(url, item_selector, max_items=200):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
page.wait_for_selector(item_selector)
seen_count = 0
no_change_count = 0
while True:
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
current_count = page.evaluate(
f"document.querySelectorAll('{item_selector}').length"
)
if current_count >= max_items:
break
if current_count == seen_count:
no_change_count += 1
if no_change_count >= 3:
break # No new content after 3 attempts
else:
no_change_count = 0
seen_count = current_count
# Extract all items
items = page.query_selector_all(item_selector)
results = [item.inner_text() for item in items]
browser.close()
return results
The no_change_count check is important. Without it, your scraper will scroll forever on the last page.
Pattern 4: Cursor-Based API Pagination
Modern APIs often use cursor-based pagination instead of page numbers. Each response includes a cursor token for the next batch.
import requests
def scrape_cursor_api(api_url, max_items=500):
all_items = []
cursor = None
while True:
params = {"limit": 50}
if cursor:
params["cursor"] = cursor
response = requests.get(api_url, params=params, timeout=10)
data = response.json()
items = data.get("results", [])
all_items.extend(items)
# Get the next cursor
cursor = data.get("next_cursor")
if not cursor or len(all_items) >= max_items:
break
print(f"Fetched {len(all_items)} items, next cursor: {cursor[:20]}...")
return all_items
# Usage
products = scrape_cursor_api("https://api.example.com/v1/products")
To find cursor-based APIs, open Chrome DevTools, go to the Network tab, and watch the XHR requests as you interact with the page. You'll often see cursor, after, or next_token parameters.
Pattern 5: Sitemap Crawling
If you need every page on a site, start with the sitemap. Most sites have one at /sitemap.xml.
import requests
from bs4 import BeautifulSoup
def get_urls_from_sitemap(sitemap_url):
response = requests.get(sitemap_url, timeout=10)
soup = BeautifulSoup(response.text, "lxml-xml")
# Check for sitemap index (links to other sitemaps)
sitemap_tags = soup.find_all("sitemap")
if sitemap_tags:
all_urls = []
for sitemap in sitemap_tags:
loc = sitemap.find("loc").text
all_urls.extend(get_urls_from_sitemap(loc)) # recursive
return all_urls
# Regular sitemap — extract URLs
urls = [loc.text for loc in soup.find_all("loc")]
return urls
# Get all product URLs
all_urls = get_urls_from_sitemap("https://example.com/sitemap.xml")
product_urls = [u for u in all_urls if "/products/" in u]
print(f"Found {len(product_urls)} product URLs")
Sitemaps give you the full list of pages upfront. No guessing about page numbers or next buttons.
Detecting Pagination Type
Not sure which pattern a site uses? Here's a quick guide:
| Sign | Pagination Type |
|---|---|
?page=2 or /page/2 in URL | URL-based numbers |
| "Next" or ">" button in HTML | Next button |
| Content loads on scroll | Infinite scroll |
API returns cursor or next_token | Cursor-based |
/sitemap.xml exists | Sitemap available |
Handling Edge Cases
Duplicate Detection
When scraping across pages, duplicates can creep in — especially with infinite scroll or if the site reorders content.
seen_ids = set()
unique_items = []
for item in all_scraped_items:
item_id = item.get("id") or item.get("url")
if item_id not in seen_ids:
seen_ids.add(item_id)
unique_items.append(item)
Last Page Detection
Don't rely only on empty results. Some sites return the last page repeatedly instead of an empty page.
previous_items = None
for page in range(1, 100):
items = scrape_page(page)
if items == previous_items:
break # Same content as last page — we've looped
previous_items = items
Rate Limiting Between Pages
Always add delays between pagination requests. A one-second delay is usually enough to avoid getting blocked.
import time
for page in range(1, total_pages + 1):
scrape_page(page)
time.sleep(1) # Be respectful
What's Next
Pagination is fundamental, but it's just one piece. You'll also need to handle anti-bot detection, manage proxies for large-scale scraping, and clean the data once you've collected it.
The Master Web Scraping course covers all five patterns with real-world projects where you scrape actual websites.