Skip to main content
Back to Blog
How to Handle Pagination in Web Scraping (5 Patterns)
13 min readby Nabeel

How to Handle Pagination in Web Scraping (5 Patterns)

pythonpaginationintermediate

Pagination is one of the first real obstacles you'll hit when scraping. A single page only shows a slice of the data — the rest is spread across dozens or hundreds of pages. You need to handle this systematically.

Here are five pagination patterns you'll encounter and how to handle each one.

Pattern 1: URL-Based Page Numbers

The simplest and most common pattern. The page number is right in the URL.

code
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
python
import requests
from bs4 import BeautifulSoup

def scrape_numbered_pages(base_url, max_pages=50): all_items = []

for page in range(1, max_pages + 1): url = f"{base_url}?page={page}" response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "lxml")

items = soup.select(".product-card") if not items: break # No more results — stop

for item in items: all_items.append({ "name": item.select_one(".name").get_text(strip=True), "price": item.select_one(".price").get_text(strip=True), })

print(f"Page {page}: {len(items)} items")

return all_items

products = scrape_numbered_pages("https://example.com/products")

The key is detecting the last page. Options: check for an empty result set, look for a "next" button that's disabled, or parse the total page count from the page.

Pattern 2: Next Button Following

Some sites don't use predictable URL patterns. Instead, each page has a "Next" link pointing to the next page. You follow the chain.

python
import requests
from bs4 import BeautifulSoup

def scrape_with_next_button(start_url): all_items = [] url = start_url

while url: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "lxml")

# Extract items from current page for item in soup.select(".product-card"): all_items.append({ "name": item.select_one(".name").get_text(strip=True), "price": item.select_one(".price").get_text(strip=True), })

# Find the next page link next_link = soup.select_one("a.next-page") if next_link and next_link.get("href"): url = next_link["href"] # Handle relative URLs if url.startswith("/"): url = "https://example.com" + url else: url = None # No next button — we're done

print(f"Scraped {len(all_items)} items so far")

return all_items

Watch out for relative URLs. The href might be /products?page=3 instead of a full URL.

Pattern 3: Infinite Scroll with Playwright

No page numbers, no next button — content loads as you scroll down. Instagram, Twitter, and many e-commerce sites use this.

python
from playwright.sync_api import sync_playwright

def scrape_infinite_scroll(url, item_selector, max_items=200): with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto(url) page.wait_for_selector(item_selector)

seen_count = 0 no_change_count = 0

while True: # Scroll to bottom page.evaluate("window.scrollTo(0, document.body.scrollHeight)") page.wait_for_timeout(2000)

current_count = page.evaluate( f"document.querySelectorAll('{item_selector}').length" )

if current_count >= max_items: break

if current_count == seen_count: no_change_count += 1 if no_change_count >= 3: break # No new content after 3 attempts else: no_change_count = 0

seen_count = current_count

# Extract all items items = page.query_selector_all(item_selector) results = [item.inner_text() for item in items]

browser.close() return results

The no_change_count check is important. Without it, your scraper will scroll forever on the last page.

Pattern 4: Cursor-Based API Pagination

Modern APIs often use cursor-based pagination instead of page numbers. Each response includes a cursor token for the next batch.

python
import requests

def scrape_cursor_api(api_url, max_items=500): all_items = [] cursor = None

while True: params = {"limit": 50} if cursor: params["cursor"] = cursor

response = requests.get(api_url, params=params, timeout=10) data = response.json()

items = data.get("results", []) all_items.extend(items)

# Get the next cursor cursor = data.get("next_cursor")

if not cursor or len(all_items) >= max_items: break

print(f"Fetched {len(all_items)} items, next cursor: {cursor[:20]}...")

return all_items

# Usage products = scrape_cursor_api("https://api.example.com/v1/products")

To find cursor-based APIs, open Chrome DevTools, go to the Network tab, and watch the XHR requests as you interact with the page. You'll often see cursor, after, or next_token parameters.

Pattern 5: Sitemap Crawling

If you need every page on a site, start with the sitemap. Most sites have one at /sitemap.xml.

python
import requests
from bs4 import BeautifulSoup

def get_urls_from_sitemap(sitemap_url): response = requests.get(sitemap_url, timeout=10) soup = BeautifulSoup(response.text, "lxml-xml")

# Check for sitemap index (links to other sitemaps) sitemap_tags = soup.find_all("sitemap") if sitemap_tags: all_urls = [] for sitemap in sitemap_tags: loc = sitemap.find("loc").text all_urls.extend(get_urls_from_sitemap(loc)) # recursive return all_urls

# Regular sitemap — extract URLs urls = [loc.text for loc in soup.find_all("loc")] return urls

# Get all product URLs all_urls = get_urls_from_sitemap("https://example.com/sitemap.xml") product_urls = [u for u in all_urls if "/products/" in u] print(f"Found {len(product_urls)} product URLs")

Sitemaps give you the full list of pages upfront. No guessing about page numbers or next buttons.

Detecting Pagination Type

Not sure which pattern a site uses? Here's a quick guide:

SignPagination Type
?page=2 or /page/2 in URLURL-based numbers
"Next" or ">" button in HTMLNext button
Content loads on scrollInfinite scroll
API returns cursor or next_tokenCursor-based
/sitemap.xml existsSitemap available
Start by checking the Network tab in DevTools. If the site makes API calls when you paginate, you can often skip the HTML entirely and hit the API directly.

Handling Edge Cases

Duplicate Detection

When scraping across pages, duplicates can creep in — especially with infinite scroll or if the site reorders content.

python
seen_ids = set()
unique_items = []

for item in all_scraped_items: item_id = item.get("id") or item.get("url") if item_id not in seen_ids: seen_ids.add(item_id) unique_items.append(item)

Last Page Detection

Don't rely only on empty results. Some sites return the last page repeatedly instead of an empty page.

python
previous_items = None

for page in range(1, 100): items = scrape_page(page) if items == previous_items: break # Same content as last page — we've looped previous_items = items

Rate Limiting Between Pages

Always add delays between pagination requests. A one-second delay is usually enough to avoid getting blocked.

python
import time

for page in range(1, total_pages + 1): scrape_page(page) time.sleep(1) # Be respectful

What's Next

Pagination is fundamental, but it's just one piece. You'll also need to handle anti-bot detection, manage proxies for large-scale scraping, and clean the data once you've collected it.

The Master Web Scraping course covers all five patterns with real-world projects where you scrape actual websites.

Want the full course?

This blog post is just a taste. The Master Web Scraping course covers 16 in-depth chapters from beginner to expert.

Get Instant Access — $19

$ need_help?

We're here for you