Skip to main content
BETAUnder active development. Some features may not work as expected.

What Is robots.txt? Web Scraping Rules & Etiquette

beginner

robots.txt is a text file at the root of a website (e.g., example.com/robots.txt) that tells web crawlers and scrapers which pages or sections of the site they should or shouldn't access. It follows the Robots Exclusion Protocol.

How robots.txt Works

Every well-configured website has a robots.txt file at its root. It contains rules for different bots:

code
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

User-agent: Googlebot Allow: /

Sitemap: https://example.com/sitemap.xml

Reading robots.txt

  • User-agent: * — rules for all bots
  • Disallow: /admin/ — don't access /admin/ or anything under it
  • Allow: / — can access everything
  • Crawl-delay: 2 — wait 2 seconds between requests
  • Sitemap: — location of the XML sitemap

Should Scrapers Respect robots.txt?

Technically, robots.txt is advisory, not legally binding. However:
  • Respect it for ethical scraping and to avoid legal issues
  • Check it first before scraping any site
  • Follow Crawl-delay to be a good citizen
  • Note: Google and other search engines strictly follow robots.txt, but the rules are suggestions for other bots

Parsing robots.txt in Python

python
from urllib.robotparser import RobotFileParser

rp = RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read()

# Check if you can scrape a URL can_scrape = rp.can_fetch("*", "https://example.com/products") crawl_delay = rp.crawl_delay("*")

Common Patterns

  • Most e-commerce sites block product API endpoints but allow product pages
  • Social media sites typically disallow everything except public profiles
  • News sites often allow crawling but set a Crawl-delay

Learn robots.txt hands-on

This glossary entry covers the basics. The Master Web Scraping course teaches you to use robots.txt in real projects across 16 in-depth chapters.

Get Instant Access — $19

$ need_help?

We're here for you