April 1, 202613 min readby Nabeel

How to Handle Pagination in Web Scraping (5 Patterns)

pythonpaginationintermediate

Pagination is one of the first real obstacles you'll hit when scraping. A single page only shows a slice of the data — the rest is spread across dozens or hundreds of pages. You need to handle this systematically.

Here are five pagination patterns you'll encounter and how to handle each one.

Pattern 1: URL-Based Page Numbers

The simplest and most common pattern. The page number is right in the URL.

code

https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3

python

import requests
from bs4 import BeautifulSoup
def scrape_numbered_pages(base_url, max_pages=50):
    all_items = []
for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product-card")
        if not items:
            break  # No more results — stop
for item in items:
            all_items.append({
                "name": item.select_one(".name").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
            })
print(f"Page {page}: {len(items)} items")
return all_items
products = scrape_numbered_pages("https://example.com/products")

The key is detecting the last page. Options: check for an empty result set, look for a "next" button that's disabled, or parse the total page count from the page.

Pattern 2: Next Button Following

Some sites don't use predictable URL patterns. Instead, each page has a "Next" link pointing to the next page. You follow the chain.

python

import requests
from bs4 import BeautifulSoup
def scrape_with_next_button(start_url):
    all_items = []
    url = start_url
while url:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")
# Extract items from current page
        for item in soup.select(".product-card"):
            all_items.append({
                "name": item.select_one(".name").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
            })
# Find the next page link
        next_link = soup.select_one("a.next-page")
        if next_link and next_link.get("href"):
            url = next_link["href"]
            # Handle relative URLs
            if url.startswith("/"):
                url = "https://example.com" + url
        else:
            url = None  # No next button — we're done
print(f"Scraped {len(all_items)} items so far")
return all_items

Watch out for relative URLs. The href might be /products?page=3 instead of a full URL.

Pattern 3: Infinite Scroll with Playwright

No page numbers, no next button — content loads as you scroll down. Instagram, Twitter, and many e-commerce sites use this.

python

from playwright.sync_api import sync_playwright
def scrape_infinite_scroll(url, item_selector, max_items=200):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector(item_selector)
seen_count = 0
        no_change_count = 0
while True:
            # Scroll to bottom
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            page.wait_for_timeout(2000)
current_count = page.evaluate(
                f"document.querySelectorAll('{item_selector}').length"
            )
if current_count >= max_items:
                break
if current_count == seen_count:
                no_change_count += 1
                if no_change_count >= 3:
                    break  # No new content after 3 attempts
            else:
                no_change_count = 0
seen_count = current_count
# Extract all items
        items = page.query_selector_all(item_selector)
        results = [item.inner_text() for item in items]
browser.close()
        return results

The no_change_count check is important. Without it, your scraper will scroll forever on the last page.

Pattern 4: Cursor-Based API Pagination

Modern APIs often use cursor-based pagination instead of page numbers. Each response includes a cursor token for the next batch.

python

import requests
def scrape_cursor_api(api_url, max_items=500):
    all_items = []
    cursor = None
while True:
        params = {"limit": 50}
        if cursor:
            params["cursor"] = cursor
response = requests.get(api_url, params=params, timeout=10)
        data = response.json()
items = data.get("results", [])
        all_items.extend(items)
# Get the next cursor
        cursor = data.get("next_cursor")
if not cursor or len(all_items) >= max_items:
            break
print(f"Fetched {len(all_items)} items, next cursor: {cursor[:20]}...")
return all_items
# Usage
products = scrape_cursor_api("https://api.example.com/v1/products")

To find cursor-based APIs, open Chrome DevTools, go to the Network tab, and watch the XHR requests as you interact with the page. You'll often see cursor, after, or next_token parameters.

Pattern 5: Sitemap Crawling

If you need every page on a site, start with the sitemap. Most sites have one at /sitemap.xml.

python

import requests
from bs4 import BeautifulSoup
def get_urls_from_sitemap(sitemap_url):
    response = requests.get(sitemap_url, timeout=10)
    soup = BeautifulSoup(response.text, "lxml-xml")
# Check for sitemap index (links to other sitemaps)
    sitemap_tags = soup.find_all("sitemap")
    if sitemap_tags:
        all_urls = []
        for sitemap in sitemap_tags:
            loc = sitemap.find("loc").text
            all_urls.extend(get_urls_from_sitemap(loc))  # recursive
        return all_urls
# Regular sitemap — extract URLs
    urls = [loc.text for loc in soup.find_all("loc")]
    return urls
# Get all product URLs
all_urls = get_urls_from_sitemap("https://example.com/sitemap.xml")
product_urls = [u for u in all_urls if "/products/" in u]
print(f"Found {len(product_urls)} product URLs")

Sitemaps give you the full list of pages upfront. No guessing about page numbers or next buttons.

Detecting Pagination Type

Not sure which pattern a site uses? Here's a quick guide:

Sign	Pagination Type
`?page=2` or `/page/2` in URL	URL-based numbers
"Next" or ">" button in HTML	Next button
Content loads on scroll	Infinite scroll
API returns `cursor` or `next_token`	Cursor-based
`/sitemap.xml` exists	Sitemap available

Start by checking the Network tab in DevTools. If the site makes API calls when you paginate, you can often skip the HTML entirely and hit the API directly.

Handling Edge Cases

Duplicate Detection

When scraping across pages, duplicates can creep in — especially with infinite scroll or if the site reorders content.

python

seen_ids = set()
unique_items = []
for item in all_scraped_items:
    item_id = item.get("id") or item.get("url")
    if item_id not in seen_ids:
        seen_ids.add(item_id)
        unique_items.append(item)

Last Page Detection

Don't rely only on empty results. Some sites return the last page repeatedly instead of an empty page.

python

previous_items = None
for page in range(1, 100):
    items = scrape_page(page)
    if items == previous_items:
        break  # Same content as last page — we've looped
    previous_items = items

Rate Limiting Between Pages

Always add delays between pagination requests. A one-second delay is usually enough to avoid getting blocked.

python

import time
for page in range(1, total_pages + 1):
    scrape_page(page)
    time.sleep(1)  # Be respectful

What's Next

Pagination is fundamental, but it's just one piece. You'll also need to handle anti-bot detection, manage proxies for large-scale scraping, and clean the data once you've collected it.

The Master Web Scraping course covers all five patterns with real-world projects where you scrape actual websites.

How to Handle Pagination in Web Scraping (5 Patterns)

Pattern 1: URL-Based Page Numbers

Pattern 2: Next Button Following

Pattern 3: Infinite Scroll with Playwright

Pattern 4: Cursor-Based API Pagination

Pattern 5: Sitemap Crawling

Detecting Pagination Type

Handling Edge Cases

Duplicate Detection

Last Page Detection

Rate Limiting Between Pages

What's Next

Key Concepts

Pagination

Web Crawling

Async Scraping

Want the full course?