Skip to main content
Back to Blog
How to Clean Scraped Data with Python and pandas
11 min readby Nabeel

How to Clean Scraped Data with Python and pandas

pythonpandasdata-cleaning

You've scraped the data. Now what? Raw scraped data is almost always messy — inconsistent formatting, missing values, duplicates, and encoding issues. Cleaning it is not optional if you want usable results.

This guide covers practical techniques for cleaning scraped data using Python and pandas.

Why Cleaning Matters

Here's what raw scraped data typically looks like:

code
name,price,rating
"  iPhone 15 Pro  ", "$1,199.00 ", "4.5 out of 5"
"iPhone 15 Pro", "1199", "4.5/5"
"IPHONE 15 PRO ", "$1,199", ""
"Samsung Galaxy S24", "N/A", "4.3 stars"

Three entries for the same product, inconsistent price formats, missing ratings, and extra whitespace everywhere. Feed this into an analysis pipeline and you'll get garbage results.

Common Data Issues

IssueExampleFix
Extra whitespace" iPhone 15 ".strip()
Inconsistent casing"iphone" vs "IPHONE".str.lower() or .str.title()
Mixed formats"$1,199" vs "1199"Regex + type conversion
Missing values"" or "N/A"Replace with NaN, then handle
DuplicatesSame product twice.drop_duplicates()
Encoding issues"Café" instead of "Cafe"Fix encoding at scrape time
HTML artifacts"Price&Cost"html.unescape()

Setting Up pandas

python
import pandas as pd
import re

# Load your scraped data df = pd.read_csv("raw_products.csv")

# Or create from a list of dicts (common when scraping) products = [ {"name": " iPhone 15 Pro ", "price": "$1,199.00", "rating": "4.5 out of 5"}, {"name": "Samsung Galaxy S24", "price": "$899", "rating": "4.3 stars"}, ] df = pd.DataFrame(products) print(df.info())

Cleaning Text Data

Whitespace and Case Normalization

python
# Strip leading/trailing whitespace from all string columns
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)

# Normalize product names to title case df["name"] = df["name"].str.title()

# Remove extra internal spaces ("iPhone 15" -> "iPhone 15") df["name"] = df["name"].str.replace(r"\s+", " ", regex=True)

Cleaning HTML Artifacts

python
import html

# Fix HTML entities df["name"] = df["name"].apply(lambda x: html.unescape(x) if isinstance(x, str) else x)

# Remove any remaining HTML tags df["description"] = df["description"].str.replace(r"<[^>]+>", "", regex=True)

Normalizing Prices

Prices are one of the messiest fields. Different sites format them differently.

python
def clean_price(value):
    """Convert price strings to float."""
    if pd.isna(value) or value in ("", "N/A", "n/a", "-"):
        return None

# Convert to string if needed value = str(value)

# Remove currency symbols and thousands separators cleaned = re.sub(r"[^\d.]", "", value)

try: return float(cleaned) except ValueError: return None

df["price"] = df["price"].apply(clean_price) df["original_price"] = df["original_price"].apply(clean_price)

# Calculate discount percentage df["discount_pct"] = ((df["original_price"] - df["price"]) / df["original_price"] * 100).round(1)

Normalizing Dates

Scraped dates come in every format imaginable.

python
# Common date formats you'll encounter
date_samples = [
    "March 15, 2026",
    "15/03/2026",
    "2026-03-15",
    "3/15/26",
    "15 Mar 2026",
]

# pandas handles most formats automatically df["date"] = pd.to_datetime(df["date"], format="mixed", dayfirst=False)

# For stubborn formats, specify explicitly df["date"] = pd.to_datetime(df["date"], format="%B %d, %Y")

# Standardize to ISO format df["date_str"] = df["date"].dt.strftime("%Y-%m-%d")

Deduplication Strategies

Duplicates in scraped data come from two sources: scraping the same page twice, and the same product appearing in multiple categories.

Exact Duplicates

python
# Remove rows that are identical across all columns
df = df.drop_duplicates()
print(f"Rows after dedup: {len(df)}")

Fuzzy Duplicates

Same product with slightly different names or formatting:

python
# Deduplicate by a key column
df = df.drop_duplicates(subset=["product_url"], keep="first")

# Or by normalized name df["name_normalized"] = df["name"].str.lower().str.strip() df = df.drop_duplicates(subset=["name_normalized"], keep="first") df = df.drop(columns=["name_normalized"])

For more advanced fuzzy matching, use the thefuzz library:

python
from thefuzz import fuzz

def find_duplicates(names, threshold=85): """Find fuzzy duplicate pairs.""" duplicates = [] for i, name1 in enumerate(names): for j, name2 in enumerate(names): if i < j and fuzz.ratio(name1.lower(), name2.lower()) >= threshold: duplicates.append((i, j, name1, name2)) return duplicates

dupes = find_duplicates(df["name"].tolist()) for i, j, n1, n2 in dupes: print(f"Possible duplicate: '{n1}' <-> '{n2}'")

Handling Missing Values

python
# Replace common "missing" strings with NaN
missing_markers = ["", "N/A", "n/a", "-", "null", "None", "not available"]
df = df.replace(missing_markers, pd.NA)

# Check missing value counts print(df.isna().sum())

# Fill missing ratings with median df["rating"] = df["rating"].fillna(df["rating"].median())

# Drop rows missing critical fields df = df.dropna(subset=["name", "price"])

print(f"Clean rows: {len(df)}")

Validating Data Types

After cleaning, make sure every column has the right type:

python
# Convert columns to proper types
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
df["review_count"] = pd.to_numeric(df["review_count"], errors="coerce").astype("Int64")
df["in_stock"] = df["in_stock"].astype(bool)

# Validate ranges assert df["price"].min() >= 0, "Negative prices found" assert df["rating"].between(0, 5).all(), "Ratings outside 0-5 range"

print(df.dtypes)

Exporting Clean Data

python
# CSV — universal, works with spreadsheets
df.to_csv("products_clean.csv", index=False, encoding="utf-8")

# JSON — preserves types better, good for APIs df.to_json("products_clean.json", orient="records", indent=2)

# Parquet — best for large datasets, preserves types perfectly df.to_parquet("products_clean.parquet", index=False)

FormatBest ForFile SizeType Preservation
CSVSpreadsheets, sharingLargePoor
JSONAPIs, web appsLargeGood
ParquetAnalysis, large dataSmallExcellent

Real Example: Cleaning a Messy Product Dataset

Here's a complete pipeline that ties it all together:

python
import pandas as pd
import re
import html

def clean_product_data(raw_file, output_file): """Full cleaning pipeline for scraped product data.""" df = pd.read_csv(raw_file) initial_count = len(df)

# 1. Strip whitespace from all text columns text_cols = df.select_dtypes(include=["object"]).columns for col in text_cols: df[col] = df[col].str.strip()

# 2. Replace missing markers df = df.replace(["", "N/A", "n/a", "-", "null"], pd.NA)

# 3. Clean prices df["price"] = df["price"].apply( lambda x: float(re.sub(r"[^\d.]", "", str(x))) if pd.notna(x) else None )

# 4. Clean ratings (extract number from "4.5 out of 5", "4.5/5", etc.) df["rating"] = df["rating"].apply( lambda x: float(re.search(r"[\d.]+", str(x)).group()) if pd.notna(x) and re.search(r"[\d.]+", str(x)) else None )

# 5. Normalize names df["name"] = df["name"].str.title() df["name"] = df["name"].apply( lambda x: html.unescape(x) if isinstance(x, str) else x )

# 6. Deduplicate df = df.drop_duplicates(subset=["name", "price"], keep="first")

# 7. Drop rows missing essentials df = df.dropna(subset=["name", "price"])

# 8. Validate df = df[df["price"] > 0] df = df[df["rating"].isna() | df["rating"].between(0, 5)]

print(f"Cleaned: {initial_count} -> {len(df)} rows") df.to_csv(output_file, index=False) return df

clean_df = clean_product_data("raw_products.csv", "clean_products.csv")

What's Next

Clean data is the foundation for everything useful — analysis, monitoring, machine learning. The investment in a solid cleaning pipeline pays off every time you run your scraper.

The Master Web Scraping course covers data cleaning as part of every scraping project, so you build the habit from the start.

Want the full course?

This blog post is just a taste. The Master Web Scraping course covers 16 in-depth chapters from beginner to expert.

Get Instant Access — $19

$ need_help?

We're here for you