Web Scraping with Python in 2026: The Complete Beginner's Guide
Web scraping is automatically extracting data from websites. If you want to build data products, automate research, or pick up freelance work, it's one of the more useful skills you can learn as a developer.
This guide covers everything you need to get started with Python web scraping, from setup to pulling real data.
Why Python for Web Scraping?
Python is the default language for scraping, and for practical reasons:
- •Simple syntax — a working scraper takes about 10 lines of code
- •Libraries like requests, BeautifulSoup, and Playwright do most of the work
- •Massive community — whatever problem you hit, someone's already posted the answer on Stack Overflow
- •pandas makes cleaning scraped data easy
Setting Up Your Environment
Before writing any code, you need Python 3.10+ and a few libraries. Here's the quickest setup:
# Install Python (if you haven't already)
# macOS: brew install python
# Windows: download from python.org
# Create a project folder
mkdir my-scraper && cd my-scraper
# Create a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install the essentials
pip install requests beautifulsoup4 lxml
That's it. Three libraries and you're ready to scrape.
Your First Scraper: Step by Step
Let's scrape a real website. We'll extract quotes from a practice site designed for scraping.
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the page
url = "https://quotes.toscrape.com"
response = requests.get(url)
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, "lxml")
# Step 3: Extract the data
quotes = soup.select(".quote")
for quote in quotes:
text = quote.select_one(".text").get_text()
author = quote.select_one(".author").get_text()
print(f"{author}: {text}")
Run this and you'll see quotes printed to your terminal. That's web scraping in under 15 lines of Python.
Key Concepts
HTTP Requests
Every scraper starts by requesting a web page. The requests library handles this:
response = requests.get("https://example.com")
print(response.status_code) # 200 = success
print(response.text) # the HTML content
Common status codes you'll see:
- •200 — success
- •403 — forbidden (the site is blocking you)
- •404 — page not found
- •429 — too many requests (you're scraping too fast)
HTML Parsing with BeautifulSoup
BeautifulSoup turns messy HTML into a navigable tree structure. The two most useful methods:
# Find one element
title = soup.select_one("h1")
# Find all matching elements
links = soup.select("a.nav-link")
# Get text content
print(title.get_text())
# Get an attribute
for link in links:
print(link["href"])
CSS Selectors
CSS selectors are how you tell BeautifulSoup which elements to extract. Here are the patterns you'll use 90% of the time:
| Selector | Matches |
|---|---|
div | All |
.price | Elements with class "price" |
#main | Element with id "main" |
div.card > h2 | directly inside |
a[href] | All tags with an href attribute |
Saving Your Scraped Data
Scraping is useless if you don't save the results. Here are the two most common formats:
CSV (for spreadsheets)
import csv
with open("quotes.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Author", "Quote"])
for quote in quotes:
text = quote.select_one(".text").get_text()
author = quote.select_one(".author").get_text()
writer.writerow([author, text])
JSON (for APIs and databases)
import json
data = []
for quote in quotes:
data.append({
"author": quote.select_one(".author").get_text(),
"text": quote.select_one(".text").get_text(),
})
with open("quotes.json", "w") as f:
json.dump(data, f, indent=2)
Common Mistakes Beginners Make
- 1.Not checking robots.txt. Always check
example.com/robots.txtbefore scraping. It tells you which pages the site allows bots to access.
- 2.Scraping too fast. Add a delay between requests. A simple
time.sleep(1)keeps you from overwhelming the server and getting blocked.
- 3.Not handling errors. Websites go down, pages change, requests fail. Wrap your scraping logic in try/except blocks.
- 4.Ignoring hidden APIs. Before reaching for BeautifulSoup, open Chrome DevTools (Network tab) and check if the site loads data via an API. Hitting the API directly is faster and more reliable.
What's Next?
This covers the basics. Real-world scraping gets harder. Here's what to learn next:
- •Pagination — scraping across multiple pages
- •Dynamic websites — handling JavaScript-rendered content with Playwright
- •Anti-bot evasion — getting past Cloudflare and other detection systems
- •Proxies — rotating IP addresses to avoid blocks
- •Scaling — scraping thousands of pages with async Python