Skip to main content
BETAUnder active development. Some features may not work as expected.

What Is HTML Parsing? Extracting Data from Web Pages

beginner

HTML parsing is the process of taking raw HTML code and converting it into a structured tree that you can navigate and query programmatically. In web scraping, parsing is the step between fetching a page and extracting the specific data you need.

How HTML Parsing Works

Raw HTML is just a string of text. A parser reads this text and builds a tree structure (called the DOM) where each HTML tag becomes a node. You can then traverse this tree to find specific elements.

code
Raw HTML:  <div class="product"><h2>Widget</h2><span>$9.99</span></div>

Parse Tree: div.product ├── h2 → "Widget" └── span → "$9.99"

Python HTML Parsers

ParserLibrarySpeedBest For
html.parserBuilt-inMediumSimple scripts, no install
lxmllxmlFastestProduction scraping
html5libhtml5libSlowestBadly broken HTML
python
from bs4 import BeautifulSoup

# Using different parsers soup = BeautifulSoup(html, "html.parser") # built-in soup = BeautifulSoup(html, "lxml") # fastest soup = BeautifulSoup(html, "html5lib") # most forgiving

Common Parsing Tasks

  • Extract text: element.text or element.get_text(strip=True)
  • Extract attributes: element.get("href"), element["src"]
  • Navigate siblings: element.next_sibling, element.previous_sibling
  • Navigate parents: element.parent, element.find_parent("div")

Handling Malformed HTML

Real-world HTML is often broken — missing closing tags, nested incorrectly, or invalid syntax. Good parsers handle this gracefully. lxml fixes most issues automatically. html5lib goes further, parsing exactly like a browser would.

Tip: Parse Only What You Need

For large pages, parsing the entire document is wasteful if you only need one section. Use SoupStrainer in BeautifulSoup to parse only matching elements:

python
from bs4 import BeautifulSoup, SoupStrainer

only_products = SoupStrainer(class_="product-card") soup = BeautifulSoup(html, "lxml", parse_only=only_products)

Learn HTML Parsing hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use html parsing in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you