Skip to main content
Back to Blog
How to Scrape E-Commerce Product Data with Python
15 min readby Nabeel

How to Scrape E-Commerce Product Data with Python

pythonecommerceproject

E-commerce sites are one of the most common scraping targets. Price monitoring, competitive analysis, product research — it all starts with extracting product data reliably.

This guide walks through scraping product information from e-commerce sites, from inspecting the page to storing structured data.

What Data to Extract

A typical product scrape collects:

  • Product name and description
  • Price (current, original, discount percentage)
  • Ratings and review counts
  • Images (URLs, not the files themselves)
  • Specifications (size, weight, material, etc.)
  • Availability (in stock, out of stock)
  • SKU or product ID (for deduplication)
Define your data structure upfront:
python
product = {
    "name": "",
    "price": 0.0,
    "original_price": 0.0,
    "rating": 0.0,
    "review_count": 0,
    "image_url": "",
    "specs": {},
    "in_stock": True,
    "url": "",
}

Inspecting Site Structure with DevTools

Before writing any code, spend five minutes in Chrome DevTools.

  1. 1.Right-click on a product name and select "Inspect"
  2. 2.Note the element tag and class names (e.g.,

    )

  3. 3.Check the Network tab — does the page load data via an API call?
  4. 4.Look at multiple products to confirm the structure is consistent
The Network tab is often the biggest shortcut. If the site loads product data from a JSON API, you can skip HTML parsing entirely and hit the API directly.

Building the Scraper Step by Step

python
import requests
from bs4 import BeautifulSoup
import time
import json

headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36", }

def scrape_product_page(url): """Scrape a single product page and return structured data.""" response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, "lxml")

# Extract product details product = { "url": url, "name": extract_text(soup, "h1.product-title"), "price": extract_price(soup, ".current-price"), "original_price": extract_price(soup, ".original-price"), "rating": extract_float(soup, ".star-rating"), "review_count": extract_int(soup, ".review-count"), "image_url": extract_attr(soup, ".product-image img", "src"), "in_stock": "out of stock" not in soup.get_text().lower(), }

return product

def extract_text(soup, selector): el = soup.select_one(selector) return el.get_text(strip=True) if el else ""

def extract_price(soup, selector): el = soup.select_one(selector) if not el: return 0.0 text = el.get_text(strip=True) # Remove currency symbols and parse cleaned = text.replace("$", "").replace(",", "").strip() try: return float(cleaned) except ValueError: return 0.0

def extract_float(soup, selector): el = soup.select_one(selector) if not el: return 0.0 try: return float(el.get_text(strip=True)) except ValueError: return 0.0

def extract_int(soup, selector): el = soup.select_one(selector) if not el: return 0 text = el.get_text(strip=True).replace(",", "") digits = "".join(c for c in text if c.isdigit()) return int(digits) if digits else 0

def extract_attr(soup, selector, attr): el = soup.select_one(selector) return el.get(attr, "") if el else ""

Helper functions keep the main scraping logic clean and handle missing elements without crashing.

Handling Product Variants

Products often have multiple variants — different sizes, colors, or configurations. These are usually loaded via JavaScript or hidden in the page source.

python
import json
import re

def extract_variants(soup): """Extract variant data from embedded JSON in the page.""" # Many e-commerce sites embed product data in a script tag scripts = soup.select("script") for script in scripts: text = script.string or "" if "variants" in text or "productData" in text: # Try to extract JSON from the script match = re.search(r'productDatas*=s*({.*?});', text, re.DOTALL) if match: data = json.loads(match.group(1)) return data.get("variants", []) return []

Look for