Beginner
Web Scraping
beginnerWeb scraping is the process of automatically extracting data from websites using code. Instead of manually copying information, a scraper fetches web pages and parses the HTML to pull out structured data.
Learn moreBeautifulSoup
beginnerBeautifulSoup is a Python library that makes it easy to parse HTML and XML documents. It creates a parse tree from page source code that you can navigate, search, and modify using Pythonic methods.
Learn moreCSS Selector
beginnerA CSS selector is a pattern used to select and target specific HTML elements on a web page. In web scraping, CSS selectors are the primary way to locate the data you want to extract from a page's HTML structure.
Learn moreHTTP Request
beginnerAn HTTP request is a message sent from a client (your scraper) to a web server asking for a resource. In web scraping, you send HTTP requests to fetch web pages, then parse the response HTML to extract data.
Learn moreHTML Parsing
beginnerHTML parsing is the process of taking raw HTML code and converting it into a structured tree that you can navigate and query programmatically. In web scraping, parsing is the step between fetching a page and extracting the specific data you need.
Learn moreRate Limiting
beginnerRate limiting is the practice of controlling the frequency of requests sent to a server. In web scraping, it means adding deliberate delays between requests to avoid overwhelming the target server or triggering anti-bot defenses.
Learn morerobots.txt
beginnerrobots.txt is a text file at the root of a website (e.g., example.com/robots.txt) that tells web crawlers and scrapers which pages or sections of the site they should or shouldn't access. It follows the Robots Exclusion Protocol.
Learn moreWeb Crawling
beginnerWeb crawling is the automated process of systematically browsing the web by following links from page to page. While web scraping extracts data from specific pages, web crawling discovers and navigates to those pages in the first place.
Learn morePagination
beginnerPagination is a web design pattern that splits large sets of content across multiple pages. When scraping, handling pagination means automatically navigating through all pages to collect the complete dataset.
Learn moreUser-Agent
beginnerA User-Agent is an HTTP header that identifies the client making a request — including the browser name, version, and operating system. Websites use User-Agent strings to serve different content and to detect automated scrapers.
Learn moreData Storage
beginnerData storage in web scraping refers to how and where you save the extracted data. The choice depends on the data size, structure, and how you plan to use it — from simple CSV files for small projects to databases for large-scale operations.
Learn morelxml
beginnerlxml is a high-performance Python library for processing XML and HTML. It provides both a Pythonic API and XPath/CSS selector support, and is the fastest HTML parser available in Python — making it the standard choice for production web scraping.
Learn moreDevTools
beginnerBrowser DevTools (Developer Tools) is a built-in set of debugging and inspection tools in web browsers. For web scraping, DevTools lets you inspect page structure, find CSS selectors, monitor network requests, and test selectors before writing code.
Learn moreIntermediate
Playwright
intermediatePlaywright is a browser automation framework developed by Microsoft that controls real browsers (Chromium, Firefox, WebKit) programmatically. For web scraping, it's used to extract data from JavaScript-heavy websites that don't render content in the initial HTML.
Learn moreScrapy
intermediateScrapy is an open-source Python framework designed for web crawling and scraping at scale. It provides built-in support for following links, handling retries, managing concurrency, and exporting data through pipelines.
Learn moreXPath
intermediateXPath (XML Path Language) is a query language for navigating and selecting elements in XML and HTML documents. It uses path-like expressions to traverse the document tree and can select elements based on their position, attributes, or text content.
Learn moreHeadless Browser
intermediateA headless browser is a web browser that runs without a graphical user interface. It can load pages, execute JavaScript, and render the DOM just like a regular browser — but operates entirely in the background, controlled by code.
Learn moreSelenium
intermediateSelenium is an open-source browser automation framework originally built for testing web applications. It controls real browsers programmatically and has been widely used for web scraping, especially for JavaScript-heavy websites.
Learn moreJavaScript Rendering
intermediateJavaScript rendering is when a website uses JavaScript to dynamically load, generate, or modify page content after the initial HTML is delivered. This means the data you want isn't in the raw HTML — it's created by JavaScript running in the browser.
Learn moreAPI Scraping
intermediateAPI scraping is the technique of identifying and using the internal APIs that websites use to load data, then calling those APIs directly instead of parsing HTML. It's faster, more reliable, and returns structured data.
Learn moreData Cleaning
intermediateData cleaning in web scraping is the process of transforming raw, messy extracted data into consistent, accurate, and usable structured data. This includes removing whitespace, parsing values, handling missing data, and normalizing formats.
Learn moreIP Ban
intermediateAn IP ban is when a website blocks all requests from a specific IP address, preventing that address from accessing any pages. In web scraping, IP bans are a common consequence of making too many requests or triggering anti-bot defenses.
Learn moreSession Cookie
intermediateA session cookie is a small piece of data stored by the browser that identifies a user's session with a website. In web scraping, managing session cookies is essential for accessing pages behind login walls and maintaining authenticated state across multiple requests.
Learn moreAdvanced
Anti-Bot Detection
advancedAnti-bot detection refers to the systems and techniques websites use to identify and block automated traffic, including web scrapers. These range from simple checks like User-Agent validation to sophisticated browser fingerprinting and behavioral analysis.
Learn moreProxy Rotation
advancedProxy rotation is the practice of distributing web scraping requests across multiple IP addresses by cycling through a pool of proxy servers. This prevents any single IP from being rate-limited or blocked.
Learn moreData Pipeline
advancedA data pipeline in web scraping is a series of automated steps that process raw scraped data into clean, structured, and usable output. It typically includes extraction, cleaning, validation, transformation, and storage stages.
Learn moreCAPTCHA
advancedCAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge-response test used to determine whether a user is human. In web scraping, CAPTCHAs are one of the most common blocking mechanisms.
Learn moreAsync Scraping
advancedAsync scraping uses asynchronous programming (Python's asyncio) to send multiple HTTP requests concurrently instead of waiting for each one to complete before starting the next. This can speed up scraping by 10-50x.
Learn moreResidential Proxy
advancedA residential proxy routes your web requests through a real residential IP address assigned by an ISP to a home user. Because the IP belongs to a real device and location, websites are far less likely to block it compared to datacenter IPs.
Learn moreWant to learn all of this hands-on?
The Master Web Scraping course covers all these concepts in 16 in-depth chapters with real code examples you can run.
Get the Full Course — $19