robots.txt can express AI crawler preferences. It cannot make every bot obey.
Use robots.txt for cooperative crawler signals, a canonical policy page for business terms, and monitoring so you know when policy coverage changes.
The common policy: allow search, block training
Many site owners want search discovery and answer-time retrieval, but do not want model training, bulk dataset creation, or commercial scraping without permission. BotConsent starts from that posture because it is easy for business owners to understand.
# Allow search-style crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Block model training or broad AI ingestion unless licensed
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Policy: https://example.com/botconsent
OpenAI crawler tokens
OpenAI documents multiple user agents for different use cases, including GPTBot, OAI-SearchBot, and ChatGPT-User. Treat those distinctions carefully: a training bot, a search crawler, and a user-initiated retrieval agent can have different business implications.
Google-Extended
Google documents Google-Extended as a product token site owners can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs. It is separate from ordinary search indexing. That makes it a useful example of why AI bot policy should be more specific than simply "allow all" or "block all."
Commercial scrapers and unknown bots
Some commercial scrapers use clear user-agent tokens. Others do not. robots.txt is most effective with good-faith operators that identify themselves and respect published rules. For unknown or evasive traffic, you may need infrastructure controls, rate limits, bot management, legal review, or contractual enforcement.
Add a canonical policy page
robots.txt is terse. A policy page can explain the business terms behind the directives:
- Allowed uses: search indexing, link previews, or assistant retrieval.
- Restricted uses: model training, dataset creation, bulk extraction, resale, or commercial scraping.
- Attribution expectations: cite canonical URLs where content appears in answer experiences.
- Paid access: publish a licensing inquiry path for bulk crawler access.
Use llms.txt as a summary, not a substitute
An llms.txt-style summary can give AI systems concise guidance, but it should not replace robots.txt or your legal terms. BotConsent treats it as a readable index pointing back to the canonical policy page.
Audit before you publish
A crawler-readiness audit can catch missing user-agent groups, contradictory rules, policy pages that do not match robots.txt, and missing licensing paths. That is why BotConsent offers the one-time audit before asking teams to commit to monitoring.