What Is robots.txt? Web Scraping Rules & Etiquette
robots.txt is a text file at the root of a website (e.g., example.com/robots.txt) that tells web crawlers and scrapers which pages or sections of the site they should or shouldn't access. It follows the Robots Exclusion Protocol.
How robots.txt Works
Every well-configured website has a robots.txt file at its root. It contains rules for different bots:
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
Reading robots.txt
- •
User-agent: *— rules for all bots - •
Disallow: /admin/— don't access /admin/ or anything under it - •
Allow: /— can access everything - •
Crawl-delay: 2— wait 2 seconds between requests - •
Sitemap:— location of the XML sitemap
Should Scrapers Respect robots.txt?
Technically, robots.txt is advisory, not legally binding. However:- •Respect it for ethical scraping and to avoid legal issues
- •Check it first before scraping any site
- •Follow Crawl-delay to be a good citizen
- •Note: Google and other search engines strictly follow robots.txt, but the rules are suggestions for other bots
Parsing robots.txt in Python
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Check if you can scrape a URL
can_scrape = rp.can_fetch("*", "https://example.com/products")
crawl_delay = rp.crawl_delay("*")
Common Patterns
- •Most e-commerce sites block product API endpoints but allow product pages
- •Social media sites typically disallow everything except public profiles
- •News sites often allow crawling but set a Crawl-delay