Guide

robots.txt can express AI crawler preferences. It cannot make every bot obey.

Use robots.txt for cooperative crawler signals, a canonical policy page for business terms, and monitoring so you know when policy coverage changes.

The common policy: allow search, block training

Many site owners want search discovery and answer-time retrieval, but do not want model training, bulk dataset creation, or commercial scraping without permission. BotConsent starts from that posture because it is easy for business owners to understand.

# Allow search-style crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block model training or broad AI ingestion unless licensed
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Policy: https://example.com/botconsent

OpenAI crawler tokens

OpenAI documents multiple user agents for different use cases, including GPTBot, OAI-SearchBot, and ChatGPT-User. Treat those distinctions carefully: a training bot, a search crawler, and a user-initiated retrieval agent can have different business implications.

Google-Extended

Google documents Google-Extended as a product token site owners can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs. It is separate from ordinary search indexing. That makes it a useful example of why AI bot policy should be more specific than simply "allow all" or "block all."

Commercial scrapers and unknown bots

Some commercial scrapers use clear user-agent tokens. Others do not. robots.txt is most effective with good-faith operators that identify themselves and respect published rules. For unknown or evasive traffic, you may need infrastructure controls, rate limits, bot management, legal review, or contractual enforcement.

Add a canonical policy page

robots.txt is terse. A policy page can explain the business terms behind the directives:

Allowed uses: search indexing, link previews, or assistant retrieval.
Restricted uses: model training, dataset creation, bulk extraction, resale, or commercial scraping.
Attribution expectations: cite canonical URLs where content appears in answer experiences.
Paid access: publish a licensing inquiry path for bulk crawler access.

Use llms.txt as a summary, not a substitute

An llms.txt-style summary can give AI systems concise guidance, but it should not replace robots.txt or your legal terms. BotConsent treats it as a readable index pointing back to the canonical policy page.

Audit before you publish

A crawler-readiness audit can catch missing user-agent groups, contradictory rules, policy pages that do not match robots.txt, and missing licensing paths. That is why BotConsent offers the one-time audit before asking teams to commit to monitoring.

Generate free policy Get crawler audit

Official references

OpenAI crawler documentation Google common crawlers and Google-Extended Cloudflare AI Crawl Control