What Is robots.txt? Web Scraping Rules & Etiquette

beginner

robots.txt is a text file at the root of a website (e.g., example.com/robots.txt) that tells web crawlers and scrapers which pages or sections of the site they should or shouldn't access. It follows the Robots Exclusion Protocol.

How robots.txt Works

Every well-configured website has a robots.txt file at its root. It contains rules for different bots:

code

User-agent: * Disallow: /admin/ Disallow: /private/ Crawl-delay: 2 User-agent: Googlebot Allow: /

Sitemap: https://example.com/sitemap.xml

Reading robots.txt

•User-agent: * — rules for all bots
•Disallow: /admin/ — don't access /admin/ or anything under it
•Allow: / — can access everything
•Crawl-delay: 2 — wait 2 seconds between requests
•Sitemap: — location of the XML sitemap

Should Scrapers Respect robots.txt?

Technically, robots.txt is advisory, not legally binding. However:

•Respect it for ethical scraping and to avoid legal issues
•Check it first before scraping any site
•Follow Crawl-delay to be a good citizen
•Note: Google and other search engines strictly follow robots.txt, but the rules are suggestions for other bots

Parsing robots.txt in Python

python

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Check if you can scrape a URL
can_scrape = rp.can_fetch("*", "https://example.com/products")
crawl_delay = rp.crawl_delay("*")

Common Patterns

•Most e-commerce sites block product API endpoints but allow product pages
•Social media sites typically disallow everything except public profiles
•News sites often allow crawling but set a Crawl-delay

What Is robots.txt? Web Scraping Rules & Etiquette

How robots.txt Works

Reading robots.txt

Should Scrapers Respect robots.txt?

Parsing robots.txt in Python

Common Patterns

Related Terms

Web Scraping

Web Crawling

Rate Limiting

Related Articles

Is Web Scraping Legal? A Practical Guide for Developers

Learn robots.txt hands-on