What Is lxml? Fast XML and HTML Parsing in Python
lxml is a high-performance Python library for processing XML and HTML. It provides both a Pythonic API and XPath/CSS selector support, and is the fastest HTML parser available in Python — making it the standard choice for production web scraping.
Why lxml?
lxml is a C-based parser that's significantly faster than Python's built-in html.parser. For scraping thousands of pages, this speed difference adds up.
| Parser | Speed (relative) | Handles Broken HTML | Install |
|---|---|---|---|
| html.parser | 1x (baseline) | Decent | Built-in |
| lxml | 5-10x faster | Good | pip install lxml |
| html5lib | 0.2x (slow) | Best | pip install html5lib |
Using lxml with BeautifulSoup
The most common pattern — use lxml as the parser backend for BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml") # just change the parser name
products = soup.select(".product-card")
Using lxml Directly
For maximum performance, use lxml's own API:
from lxml import html
tree = html.fromstring(page_content)
# XPath
titles = tree.xpath('//h2[@class="title"]/text()')
prices = tree.xpath('//span[@class="price"]/text()')
# CSS Selectors (via cssselect)
from lxml.cssselect import CSSSelector
selector = CSSSelector(".product-card .title")
elements = selector(tree)
When to Use lxml Directly vs. BeautifulSoup
- •BeautifulSoup + lxml: When you want a friendly API and don't need maximum speed
- •lxml directly: When parsing speed is critical (millions of pages) or you prefer XPath
Installation Note
lxml requires C libraries to compile. On most systems, pip install lxml works fine. On some Linux systems, you may need to install libxml2-dev and libxslt-dev first.