Skip to main content

What Is HTML Parsing? Extracting Data from Web Pages

beginner

HTML parsing is the process of taking raw HTML code and converting it into a structured tree that you can navigate and query programmatically. In web scraping, parsing is the step between fetching a page and extracting the specific data you need.

How HTML Parsing Works

Raw HTML is just a string of text. A parser reads this text and builds a tree structure (called the DOM) where each HTML tag becomes a node. You can then traverse this tree to find specific elements.

code
Raw HTML:  <div class="product"><h2>Widget</h2><span>$9.99</span></div>

Parse Tree: div.product ├── h2 → "Widget" └── span → "$9.99"

Python HTML Parsers

ParserLibrarySpeedBest For
html.parserBuilt-inMediumSimple scripts, no install
lxmllxmlFastestProduction scraping
html5libhtml5libSlowestBadly broken HTML
python
from bs4 import BeautifulSoup

# Using different parsers soup = BeautifulSoup(html, "html.parser") # built-in soup = BeautifulSoup(html, "lxml") # fastest soup = BeautifulSoup(html, "html5lib") # most forgiving

Common Parsing Tasks

  • Extract text: element.text or element.get_text(strip=True)
  • Extract attributes: element.get("href"), element["src"]
  • Navigate siblings: element.next_sibling, element.previous_sibling
  • Navigate parents: element.parent, element.find_parent("div")

Handling Malformed HTML

Real-world HTML is often broken — missing closing tags, nested incorrectly, or invalid syntax. Good parsers handle this gracefully. lxml fixes most issues automatically. html5lib goes further, parsing exactly like a browser would.

Tip: Parse Only What You Need

For large pages, parsing the entire document is wasteful if you only need one section. Use SoupStrainer in BeautifulSoup to parse only matching elements:

python
from bs4 import BeautifulSoup, SoupStrainer

only_products = SoupStrainer(class_="product-card") soup = BeautifulSoup(html, "lxml", parse_only=only_products)

Learn HTML Parsing hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use html parsing in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you