Storing Scraped Data: CSV, JSON, Databases & More
Data storage in web scraping refers to how and where you save the extracted data. The choice depends on the data size, structure, and how you plan to use it — from simple CSV files for small projects to databases for large-scale operations.
Storage Options Compared
| Format | Best For | Max Scale | Query Support |
|---|---|---|---|
| CSV | Simple tabular data | ~1M rows | No (use pandas) |
| JSON | Nested/flexible data | ~100K records | No |
| SQLite | Medium projects | ~10M rows | Full SQL |
| PostgreSQL | Production systems | Unlimited | Full SQL |
| MongoDB | Varied/nested data | Unlimited | Query language |
CSV — The Simple Default
import csv
with open("products.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
writer.writeheader()
writer.writerows(products)
JSON — For Nested Data
import json
with open("products.json", "w") as f:
json.dump(products, f, indent=2)
SQLite — The Sweet Spot
import sqlite3
conn = sqlite3.connect("scraping.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT, price REAL, url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute(
"INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)",
(name, price, url)
)
conn.commit()
Best Practices
- •Always deduplicate: Use UNIQUE constraints or check before inserting
- •Add timestamps: Record when each item was scraped
- •Store raw + clean: Keep the original data alongside processed versions
- •Batch inserts: Insert many rows at once instead of one-by-one
- •Use UPSERT: Update existing records instead of failing on duplicates