Skip to main content
BETAUnder active development. Some features may not work as expected.

Web Scraping Glossary

Every concept, tool, and technique you need to know — explained with real code examples. 29 terms from beginner to advanced.

Beginner

Web Scraping

beginner

Web scraping is the process of automatically extracting data from websites using code. Instead of manually copying information, a scraper fetches web pages and parses the HTML to pull out structured data.

Learn more

BeautifulSoup

beginner

BeautifulSoup is a Python library that makes it easy to parse HTML and XML documents. It creates a parse tree from page source code that you can navigate, search, and modify using Pythonic methods.

Learn more

CSS Selector

beginner

A CSS selector is a pattern used to select and target specific HTML elements on a web page. In web scraping, CSS selectors are the primary way to locate the data you want to extract from a page's HTML structure.

Learn more

HTTP Request

beginner

An HTTP request is a message sent from a client (your scraper) to a web server asking for a resource. In web scraping, you send HTTP requests to fetch web pages, then parse the response HTML to extract data.

Learn more

HTML Parsing

beginner

HTML parsing is the process of taking raw HTML code and converting it into a structured tree that you can navigate and query programmatically. In web scraping, parsing is the step between fetching a page and extracting the specific data you need.

Learn more

Rate Limiting

beginner

Rate limiting is the practice of controlling the frequency of requests sent to a server. In web scraping, it means adding deliberate delays between requests to avoid overwhelming the target server or triggering anti-bot defenses.

Learn more

robots.txt

beginner

robots.txt is a text file at the root of a website (e.g., example.com/robots.txt) that tells web crawlers and scrapers which pages or sections of the site they should or shouldn't access. It follows the Robots Exclusion Protocol.

Learn more

Web Crawling

beginner

Web crawling is the automated process of systematically browsing the web by following links from page to page. While web scraping extracts data from specific pages, web crawling discovers and navigates to those pages in the first place.

Learn more

Pagination

beginner

Pagination is a web design pattern that splits large sets of content across multiple pages. When scraping, handling pagination means automatically navigating through all pages to collect the complete dataset.

Learn more

User-Agent

beginner

A User-Agent is an HTTP header that identifies the client making a request — including the browser name, version, and operating system. Websites use User-Agent strings to serve different content and to detect automated scrapers.

Learn more

Data Storage

beginner

Data storage in web scraping refers to how and where you save the extracted data. The choice depends on the data size, structure, and how you plan to use it — from simple CSV files for small projects to databases for large-scale operations.

Learn more

lxml

beginner

lxml is a high-performance Python library for processing XML and HTML. It provides both a Pythonic API and XPath/CSS selector support, and is the fastest HTML parser available in Python — making it the standard choice for production web scraping.

Learn more

DevTools

beginner

Browser DevTools (Developer Tools) is a built-in set of debugging and inspection tools in web browsers. For web scraping, DevTools lets you inspect page structure, find CSS selectors, monitor network requests, and test selectors before writing code.

Learn more

Intermediate

Playwright

intermediate

Playwright is a browser automation framework developed by Microsoft that controls real browsers (Chromium, Firefox, WebKit) programmatically. For web scraping, it's used to extract data from JavaScript-heavy websites that don't render content in the initial HTML.

Learn more

Scrapy

intermediate

Scrapy is an open-source Python framework designed for web crawling and scraping at scale. It provides built-in support for following links, handling retries, managing concurrency, and exporting data through pipelines.

Learn more

XPath

intermediate

XPath (XML Path Language) is a query language for navigating and selecting elements in XML and HTML documents. It uses path-like expressions to traverse the document tree and can select elements based on their position, attributes, or text content.

Learn more

Headless Browser

intermediate

A headless browser is a web browser that runs without a graphical user interface. It can load pages, execute JavaScript, and render the DOM just like a regular browser — but operates entirely in the background, controlled by code.

Learn more

Selenium

intermediate

Selenium is an open-source browser automation framework originally built for testing web applications. It controls real browsers programmatically and has been widely used for web scraping, especially for JavaScript-heavy websites.

Learn more

JavaScript Rendering

intermediate

JavaScript rendering is when a website uses JavaScript to dynamically load, generate, or modify page content after the initial HTML is delivered. This means the data you want isn't in the raw HTML — it's created by JavaScript running in the browser.

Learn more

API Scraping

intermediate

API scraping is the technique of identifying and using the internal APIs that websites use to load data, then calling those APIs directly instead of parsing HTML. It's faster, more reliable, and returns structured data.

Learn more

Data Cleaning

intermediate

Data cleaning in web scraping is the process of transforming raw, messy extracted data into consistent, accurate, and usable structured data. This includes removing whitespace, parsing values, handling missing data, and normalizing formats.

Learn more

IP Ban

intermediate

An IP ban is when a website blocks all requests from a specific IP address, preventing that address from accessing any pages. In web scraping, IP bans are a common consequence of making too many requests or triggering anti-bot defenses.

Learn more

Session Cookie

intermediate

A session cookie is a small piece of data stored by the browser that identifies a user's session with a website. In web scraping, managing session cookies is essential for accessing pages behind login walls and maintaining authenticated state across multiple requests.

Learn more

Advanced

Anti-Bot Detection

advanced

Anti-bot detection refers to the systems and techniques websites use to identify and block automated traffic, including web scrapers. These range from simple checks like User-Agent validation to sophisticated browser fingerprinting and behavioral analysis.

Learn more

Proxy Rotation

advanced

Proxy rotation is the practice of distributing web scraping requests across multiple IP addresses by cycling through a pool of proxy servers. This prevents any single IP from being rate-limited or blocked.

Learn more

Data Pipeline

advanced

A data pipeline in web scraping is a series of automated steps that process raw scraped data into clean, structured, and usable output. It typically includes extraction, cleaning, validation, transformation, and storage stages.

Learn more

CAPTCHA

advanced

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge-response test used to determine whether a user is human. In web scraping, CAPTCHAs are one of the most common blocking mechanisms.

Learn more

Async Scraping

advanced

Async scraping uses asynchronous programming (Python's asyncio) to send multiple HTTP requests concurrently instead of waiting for each one to complete before starting the next. This can speed up scraping by 10-50x.

Learn more

Residential Proxy

advanced

A residential proxy routes your web requests through a real residential IP address assigned by an ISP to a home user. Because the IP belongs to a real device and location, websites are far less likely to block it compared to datacenter IPs.

Learn more

Want to learn all of this hands-on?

The Master Web Scraping course covers all these concepts in 16 in-depth chapters with real code examples you can run.

Get the Full Course — $19

$ need_help?

We're here for you