Managing AI crawler access requires distinguishing between bots that train on your content, bots that retrieve it for live answers, and bots that index it for search. Each type offers a different value exchange, and the wrong blocking decision can either leak competitive value or cost you visibility. This article covers how to identify crawler types, evaluate risk across multiple enforcement layers, and implement access policies by content type.
Three types of crawlers, three different purposes
Not all bots requesting your content want the same thing. The type of crawler determines whether blocking it protects your content, costs you visibility, or has no impact at all.
Training crawlers collect content to update model weights. When GPTBot or ClaudeBot fetches your pages, that content may be incorporated into future model versions. The value extraction is permanent—the content becomes part of the model itself, usable indefinitely without further access to your site.
Retrieval crawlers fetch content at inference time to ground AI responses. When a user asks a question, these systems search for relevant content, retrieve it, and use it to inform the answer. This is the RAG (Retrieval-Augmented Generation) pattern. The content isn't absorbed into model weights—it's referenced in real-time, often with citation.
Search crawlers index content for traditional search results. Googlebot and Bingbot have operated this way for decades: crawl, index, rank, serve results with links back to your site.
| Crawler type | Impact to content | Value exchange |
|---|---|---|
| Training | Absorbed into model weights | One-time extraction; no ongoing attribution |
| Retrieval | Fetched at query time, cited | Per-query use; potential traffic via citation |
| Search | Indexed and linked in results | Ongoing traffic via click-through |
These boundaries aren't always clean. Some crawlers serve multiple purposes, and operators don't always disclose which. Google-Extended, for example, is specifically for AI training, distinct from Googlebot's search indexing. But not all operators make such distinctions.
The access-control stack
robots.txt is the most visible access control mechanism, but it's neither the only layer nor the most enforceable. Effective access control operates across multiple layers, each with different characteristics.
robots.txt: Advisory, crawler-declared, path-level. The crawler identifies itself and checks your robots.txt for permission. Compliance is voluntary. A crawler that ignores robots.txt faces no technical barrier.
HTTP layer: Authentication (401), authorisation (403), IP allowlists, signed URLs. HTTP (Hypertext Transfer Protocol) is how browsers and crawlers request content from servers. These are enforceable gates—if the request lacks valid credentials, the server returns an error, not the content. Hard paywalls and login requirements operate here.
Delivery layer: Edge logic at the CDN or bot management layer. Bot management services (such as Cloudflare Bot Management, Akamai Bot Manager, or Fastly) identify, rate-limit, or block traffic based on behavioural signals, IP reputation, or declared identity. This layer can act before requests reach your origin servers.
Content layer: What you actually serve. Partial responses, summaries, abstracts, or truncated feeds give crawlers something while withholding full content. Metered paywalls typically operate here—the page returns HTTP 200 but serves only a preview. API endpoints may serve structured data differently from rendered pages.
Licensing layer: Legal intent expressed through Terms of Service, licensing endpoints, ai.txt, or TDMRep declarations. These don't technically prevent access, but they establish legal grounding for enforcement actions.
robots.txt is not enforcement—a crawler can read it, ignore it, and fetch your content anyway. The only layers that technically prevent access are HTTP authentication and delivery-layer blocking. Everything else is a signal of intent.
Why this matters for SEO teams
Product and engineering teams often make access decisions without SEO involvement. APIs may serve content differently than web pages. Paywalls gate content that SEO teams expect to be indexed. Bot management rules may block crawlers that SEO wasn't consulted about.
Access control is a cross-functional concern. If you're responsible for organic visibility, you need visibility into what's happening at each layer—not just robots.txt.
Known AI crawlers and their stated purposes
The table below groups the major AI-related crawlers by operator. It reflects a significant shift since 2024: the leading operators now split their activity into separately controllable bots—training, user-triggered retrieval, and search—and most now publish a machine-readable IP-range file you can use to verify requests. This information changes frequently; verify against operator documentation before making blocking decisions.
Types follow the three-purpose model above: Training feeds model weights, Retrieval fetches pages at query time (usually user-triggered, often cited), and Search builds an index. Two tokens—Google-Extended and Applebot-Extended—aren't crawlers at all: they are robots.txt controls governing whether content fetched by the operator's main crawler may be used for AI training.
| Provider | User-agent | Type | Respects robots.txt | Verify (published IP ranges) |
|---|---|---|---|---|
| OpenAI | GPTBot |
Training | Yes | gptbot.json |
| OpenAI | ChatGPT-User |
Retrieval (user-triggered) | May not apply¹ | chatgpt-user.json |
| OpenAI | OAI-SearchBot |
Search | Yes | searchbot.json |
| OpenAI | OAI-AdsBot |
Ads (safety review) | Yes | adsbot.json |
| Anthropic | ClaudeBot |
Training | Yes | bots.json |
| Anthropic | Claude-User |
Retrieval (user-triggered) | Yes | bots.json |
| Anthropic | Claude-SearchBot |
Search | Yes | bots.json |
Googlebot |
Search | Yes | googlebot.json | |
Google-Extended |
AI-training control | Robots.txt token | Docs | |
| Perplexity | PerplexityBot |
Search (not for training) | Yes | perplexitybot.json |
| Perplexity | Perplexity-User |
Retrieval (user-triggered) | Generally ignores¹ | perplexity-user.json |
| Apple | Applebot |
Search | Yes | applebot.json |
| Apple | Applebot-Extended |
AI-training control | Robots.txt token | Docs |
| Microsoft | Bingbot |
Search (also feeds Copilot) | Yes | bingbot.json |
| DuckDuckGo | DuckAssistBot |
Retrieval (cites sources; not for training) | Yes | duckassistbot.json |
| Common Crawl | CCBot |
Training (open archive, widely reused) | Yes | None published |
| Meta | meta-externalagent |
Training | Yes | None published |
| ByteDance | Bytespider |
Training / search | Historically ignored | None published |
| Amazon | Amazonbot |
AI + Alexa | Yes | None published |
| Cohere | cohere-ai |
Training | — | None published |
¹ User-triggered fetchers behave differently from crawlers. Because a person initiates the request, some operators treat these like a browser: OpenAI notes robots.txt "may not apply" to ChatGPT-User, and Perplexity-User generally ignores robots.txt—whereas Anthropic's Claude-User and DuckDuckGo's DuckAssistBot do honour it. Don't assume a retrieval bot obeys robots.txt just because the training bot from the same operator does.
Last verified: July 2026. The published IP-range files linked above are the same lists Dave Smart's Real LLM Bot IP checker uses to confirm whether a request claiming to be an LLM bot genuinely originates from the operator—user-agent strings can be spoofed, IP ranges can't.
This table reflects documented information from operators. Some crawlers have limited or no public documentation. User-agent strings can be spoofed by any client. Presence in this table does not guarantee the crawler claiming that identity is legitimate.
Training vs retrieval: often unclear
The major operators now distinguish all three purposes with separate, individually addressable bots. OpenAI splits GPTBot (training), ChatGPT-User (browsing), and OAI-SearchBot (search); Anthropic mirrors this with ClaudeBot, Claude-User, and Claude-SearchBot. Google gates AI training behind the Google-Extended robots.txt token, separate from Googlebot's search crawl.
Others don't separate training from retrieval crawlers, or their documentation is ambiguous. When documentation is unclear, assume the crawler may use content for training unless explicitly stated otherwise.
The crawler identity problem
User-agent strings are cheap. Verification is not.
Any HTTP client can declare itself as "GPTBot" or "Googlebot" in the User-Agent header. Search engines invested years building verification norms. Google publishes IP ranges and supports reverse DNS verification. Bing does the same. When you see a request claiming to be Googlebot, you can verify it.
That gap has narrowed sharply. Across 2025 and into 2026, OpenAI, Anthropic, Perplexity, Apple, and DuckDuckGo all began publishing machine-readable IP-range files (linked in the crawler table above), so you can now verify most major AI crawlers by IP—even though few yet support the reverse DNS lookups that Googlebot and Bingbot offer.
Verification asymmetry
| Crawler | Reverse DNS verification | Published IP ranges |
|---|---|---|
| Googlebot | Yes | Yes |
| Bingbot | Yes | Yes |
| Applebot | Yes (*.applebot.apple.com) |
Yes |
| GPTBot / OpenAI bots | No | Yes |
| ClaudeBot / Claude bots | No | Yes |
| PerplexityBot / Perplexity-User | No | Yes |
| DuckAssistBot | No | Yes |
| CCBot | No | No |
| Bytespider | No | No |
The remaining gap is reverse DNS: most AI operators now publish IP ranges but don't yet support the reverse-DNS lookups that let you confirm an IP without downloading a list. For crawlers with neither mechanism—CCBot, Bytespider, and most training-only bots—you cannot reliably distinguish legitimate requests from spoofed ones using log analysis alone.
Operational implications
False positives in blocking: If you block based on user-agent string alone, you may block legitimate traffic that happens to match a pattern, or you may miss spoofed traffic that uses a slightly different string.
False negatives in detection: Traffic claiming to be something innocuous might actually be an undisclosed AI crawler. You can't know what you can't identify.
Over-blocking risks: Aggressive blocking of unverified "AI bot" traffic may inadvertently block legitimate retrieval systems that could drive traffic via citations. The visibility in AI systems depends partly on retrieval access.
What log analysis can and cannot reveal
Server logs show requests and their declared identity. They cannot confirm:
- Whether the declared identity is truthful
- Whether the crawler is training, retrieving, or both
- What happens to your content after it's fetched
See Log File Analysis for techniques to identify crawler traffic, with appropriate scepticism about unverifiable claims.
Verify before acting on crawler identity. For Googlebot, Bingbot, and Applebot, use reverse DNS. For OpenAI, Anthropic, Perplexity, and DuckDuckGo, match the request IP against the operator's published range file (linked in the crawler table above). Dave Smart's Real LLM Bot IP checker runs that lookup for you across the major LLM bots. For crawlers with neither mechanism, treat the user-agent as indicative, not definitive.
Two separate risk axes
Decisions about AI crawler access often conflate two different questions:
- Will blocking this harm my search visibility?
- Will allowing this leak competitive or proprietary value?
Search visibility risk and extraction risk are orthogonal concerns with different answers depending on the crawler.
Search equity risk
Blocking crawlers that influence search rankings directly affects organic traffic. The impact of blocking Googlebot is obvious: your pages won't be indexed. But some AI-related crawlers have no relationship to search visibility.
Extraction risk
Allowing crawlers to access your content exposes it to potential use in training, which may:
- Reduce the need for users to visit your site (answers synthesised from your content)
- Enable competitors to benefit from models trained on your data
- Create no reciprocal value if the model doesn't cite sources
The risk matrix
| Crawler | Search equity risk if blocked | Extraction risk if allowed |
|---|---|---|
| Googlebot | High (no Google indexing) | Low (search index, not training) |
| Bingbot | Medium (Bing + Copilot) | Low (search index, not training) |
| Google-Extended | None (search unaffected) | Medium (Gemini training) |
| GPTBot | None (no search role) | Medium-High (model training) |
| OAI-SearchBot | Medium (ChatGPT search results) | Low (search index, not training) |
| ChatGPT-User | Low (no search role) | Low (user-triggered retrieval, may cite) |
| ClaudeBot | None | Medium-High (model training) |
| Claude-SearchBot | Medium (Claude search results) | Low (search index, not training) |
| PerplexityBot | Medium (Perplexity answers link out) | Low (search/citation, not training) |
| CCBot | None directly | High (archive widely reused for training) |
"Low search equity risk" doesn't mean zero consequence. Blocking retrieval crawlers may reduce your visibility in AI-generated answers, which affects brand exposure even if traditional search rankings are unaffected. See Visibility in LLMs and AI Overviews for how retrieval affects AI discoverability.
Using the matrix for decisions
For content where extraction risk is low (public marketing pages, general informational content), broad access makes sense. The discovery value outweighs extraction concerns.
For content where extraction risk is high (proprietary research, premium content, competitive intelligence), block training crawlers while considering whether retrieval access with citation provides acceptable value exchange.
For content where search equity is critical (core landing pages, product pages), tread carefully. Distinguish between crawlers that affect search (Googlebot, Bingbot) and those that don't (most AI training crawlers).
robots.txt for AI crawler management
robots.txt remains the primary mechanism for communicating access preferences to crawlers. Its limitations are significant, but it's the most widely supported signal.
Basic syntax
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Disallow:
This configuration blocks several AI training crawlers while explicitly allowing Googlebot for search indexing.
Common configurations
Block all AI training, allow search:
# AI training crawlers - block
# (anthropic-ai and Claude-Web are Anthropic's deprecated tokens, kept for safety)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: meta-externalagent
User-agent: cohere-ai
Disallow: /
# Search crawlers - allow (index and link back to you)
User-agent: Googlebot
User-agent: Bingbot
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Applebot
Disallow:
# User-triggered retrieval - allow (fetched when a person asks; often cited)
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: DuckAssistBot
Disallow:
Selective path blocking:
User-agent: GPTBot
User-agent: ClaudeBot
Disallow: /
Allow: /marketing/
This allows AI crawlers to access marketing content while blocking everything else.
Limitations to remember
- Advisory only: Crawlers choose whether to comply
- Crawler-declared: You're trusting the user-agent string
- Path-level: Cannot distinguish by content type, only URL pattern
- No enforcement: No technical barrier prevents access if ignored
- No retroactive effect: Content already crawled remains in training data
Beyond robots.txt: licensing and intent signals
Several mechanisms aim to communicate AI-specific access preferences beyond robots.txt. None are universally adopted or enforced.
ai.txt
A proposed convention for declaring AI access preferences in a dedicated file. Illustrative format:
# ai.txt
User-Agent: *
Disallow-Training: /
Allow-Retrieval: /public/
Status: Proposal stage. No major AI operators have committed to honouring it. Some implementations exist, but adoption is limited.
llms.txt
A proposed convention placing a markdown file at /llms.txt containing site information optimised for LLM consumption—essentially a human-readable summary designed to help AI systems understand your site's purpose and content.
Status: Not an accepted standard. No major AI providers (OpenAI, Anthropic, Google) or search engines have committed to checking or respecting this file. Without operator adoption, the file is unlikely to be crawled or processed. Implementation effort is better directed toward proven technical improvements such as structured data, crawl efficiency, and content quality.
Content Signals (Cloudflare)
Cloudflare's Content Signals Policy extends robots.txt with directives specifying how content may be used after access. Unlike traditional robots.txt (which controls where crawlers can go), Content Signals declare what crawlers may do with content they've fetched.
Three signals are defined:
- search: Permission to build a search index and show links or snippets in results (traditional search behaviour)
- ai-input: Permission to use content as input for AI-generated answers (RAG, AI Overviews, chatbot responses)
- ai-train: Permission to use content to train or fine-tune AI models
Example configuration:
User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=no
Allow: /
This allows traditional search indexing while prohibiting both model training and real-time AI answer generation.
The policy includes human-readable comments that frame the signals as an "express reservation of rights" under the EU's 2019 Copyright Directive, positioning them as legally significant declarations rather than mere requests.
Adoption: Cloudflare has deployed Content Signals across 3.8 million domains using its managed robots.txt feature, with defaults of search=yes and ai-train=no. The ai-input signal is left unset by default.
Status: Not a ratified protocol extension. Compliance is voluntary. Google has not confirmed whether it will respect Content Signals. The specification is released under CC0 licence to encourage broader adoption. Generate the policy at contentsignals.org.
Content Signals are preferences, not enforcement. Cloudflare recommends pairing them with WAF rules and bot management to block crawlers that ignore your declared preferences.
TDMRep (Text and Data Mining Reservation Protocol)
A W3C community group specification for declaring TDM (Text and Data Mining) permissions via HTTP headers or HTML meta tags.
<meta name="tdm-reservation" content="1">
Or via HTTP header:
TDM-Reservation: 1
Status: Defined specification with limited adoption. Some EU regulatory frameworks reference TDM rights, which may increase relevance over time.
Terms of Service
Legal declarations that prohibit scraping, training, or specific uses of content. Not machine-readable, but establishes legal grounding for enforcement actions.
Status: Widely used, difficult to enforce, requires legal action to remediate violations.
Licensing endpoints
Emerging pattern where publishers offer structured licensing terms for AI training use. Content may be available for free retrieval but require licensing for training inclusion.
Status: Early stage. Some publishers have announced licensing deals with AI operators; no standardised protocol exists.
ai.txt, llms.txt, Content Signals, TDMRep, and licensing protocols are not standards with guaranteed compliance. They signal intent. Operators may ignore them. Include these signals if you want to establish clear intent, but don't rely on them as enforcement mechanisms.
Hard limits of certainty
Certain things cannot be known with current tools and disclosures. Being explicit about these limits protects against false confidence.
You cannot audit whether previously crawled content influenced a model. If a crawler accessed your content before you blocked it, that content may already be in training data. No mechanism exists to verify inclusion or request removal from trained models.
You cannot verify whether a "block" prevented inclusion. robots.txt compliance is voluntary. A crawler may have read your robots.txt, ignored it, and fetched content anyway. You have no visibility into this.
You cannot reliably map crawlers to downstream features. When your content appears in an AI-generated answer, determining which crawl (training or retrieval) contributed is often impossible.
You cannot enforce ai.txt, llms.txt, or TDM signals. These signals declare intent. Whether any operator reads or respects them is opaque.
You cannot distinguish all training from retrieval crawlers. Some operators don't separate these functions. Some don't document their crawlers at all.
This uncertainty is structural, not a temporary gap. Access control decisions must account for these unknowns rather than assuming visibility that doesn't exist.
Practical implementation framework
Use the following framework to make access decisions based on content type and business model.
| Content type | Recommended approach | Rationale |
|---|---|---|
| Public marketing | Allow broadly | Discovery value exceeds extraction risk |
| Blog / Editorial | Allow broadly; consider blocking training | Attribution via retrieval has value; training doesn't |
| Product pages | Allow search; consider blocking training | Search visibility critical; training value unclear |
| Premium / Paywalled | Block training; gate at HTTP layer | Protect commercial value |
| Proprietary research | Block training; delivery-layer enforcement | High extraction risk warrants strong controls |
| User-generated content | Complex; review licensing terms | May have legal constraints on third-party use |
Implementation checklist
- Audit current state: What crawlers are accessing your site? (Log analysis)
- Classify content: Which sections have different risk profiles?
- Choose enforcement layers: robots.txt alone, or additional HTTP/delivery controls?
- Implement robots.txt: Block training crawlers; explicitly allow search crawlers
- Consider additional signals: TDMRep headers, Terms of Service updates
- Monitor and adjust: Track crawler behaviour post-implementation
Testing approach
After implementing blocks:
- Verify robots.txt is accessible and correctly formatted (use a robots.txt validator for syntax)
- Check server logs for crawler requests and responses
- Monitor for new/unknown crawler user-agents
- Review Search Console for unexpected indexing changes (if search crawlers were affected)
What to watch
AI crawler management and access control evolve rapidly. The following signals indicate where things may be heading.
Crawler identity practices: Will AI operators adopt verification norms similar to search engines? Standardised verification would enable more confident access control decisions.
Licensing-backed retrieval: Commercial models where content access is negotiated rather than scraped. If this becomes standard, the "block everything" approach may give way to selective licensing arrangements.
Search and RAG convergence: Google and others are blending traditional indexing with retrieval-augmented generation. The distinction between "search crawler" and "AI crawler" may become less clear as search itself incorporates AI synthesis.
Regulatory pressure: EU AI Act, copyright litigation outcomes, and TDM opt-out enforcement may force operators to respect declared preferences. Regulatory clarity would change the enforcement equation.
robots.txt evolution: Potential for new directives or extensions specific to AI use cases. The robots.txt specification hasn't changed substantively in decades; pressure for AI-specific signals may drive evolution.
AI agents and tool access: Distinct from crawlers that fetch content for training or retrieval, AI agents can execute actions via APIs. Systems like ChatGPT plugins, Gemini extensions, or enterprise AI assistants may call your APIs to check inventory, retrieve pricing, or complete transactions. This introduces a different access question: not whether to allow content crawling, but whether to expose transactional capabilities. Documented APIs (OpenAPI specifications) that agents can discover and call may become a channel alongside traditional web traffic. Access control for agent tool-calling operates through API authentication and rate limiting rather than robots.txt.
FAQs
Does blocking GPTBot affect my visibility in ChatGPT?
No. GPTBot is training-only—blocking it keeps your content out of future model training but has no effect on ChatGPT's live features. Two other OpenAI bots drive those: ChatGPT-User fetches pages when a user asks ChatGPT to browse, and OAI-SearchBot indexes content for ChatGPT's search results. To appear in ChatGPT while staying out of training data, block GPTBot but allow ChatGPT-User and OAI-SearchBot.
What is Claude-SearchBot, and should I block it?
Claude-SearchBot is Anthropic's newest crawler—it indexes pages to improve the quality of Claude's search results. It is distinct from ClaudeBot (training) and Claude-User (fetches a page when a Claude user asks about it). Blocking Claude-SearchBot removes you from Claude's search results without changing your training-data decision, so treat it like any other search crawler rather than lumping it in with the training bot.
How do I verify a request is really from an AI crawler?
Match the request's IP address against the operator's published range file—OpenAI, Anthropic, Perplexity, Apple, and DuckDuckGo all publish one (linked in the crawler table above), and Googlebot, Bingbot, and Applebot also support reverse DNS. A user-agent string on its own proves nothing; anyone can send User-Agent: ClaudeBot. Dave Smart's Real LLM Bot IP checker runs the IP lookup across the major LLM bots for you.
Should I block Common Crawl (CCBot)?
Common Crawl's archive is widely used for AI training. Blocking CCBot reduces exposure to this pathway. However, Common Crawl also supports legitimate research and archival purposes. The decision depends on whether you value those uses versus the training exposure risk.
Key takeaways
-
Three crawler types, three purposes: Training crawlers absorb content into models; retrieval crawlers fetch content at query time; search crawlers index for traditional results. Each warrants different access decisions.
-
robots.txt is advisory, not enforcement: Compliance depends on crawler operators. The only technical enforcement occurs at the HTTP or delivery layer.
-
Verification is asymmetric: Search engines invested years in verification norms. Most AI crawlers lack equivalent mechanisms. User-agent strings can be spoofed; treat unverified identities with scepticism.
-
Separate search risk from extraction risk: Blocking GPTBot has no search impact. Blocking Googlebot does. Evaluate each crawler against both dimensions.
-
Uncertainty is structural: You cannot audit training data inclusion, verify robots.txt compliance, or know what happens to crawled content. Make decisions with these limits in mind.
Further reading
- OpenAI crawler documentation
Official documentation for GPTBot, ChatGPT-User, OAI-SearchBot, and OAI-AdsBot, with IP-range files - Anthropic crawler documentation
Official documentation for ClaudeBot, Claude-User, and Claude-SearchBot, with a shared IP-range file - Perplexity crawler documentation
Official documentation for PerplexityBot and Perplexity-User, with published IP ranges - Apple Applebot documentation
Official documentation for Applebot and Applebot-Extended, including IP ranges and reverse DNS - DuckDuckGo DuckAssistBot documentation
Official documentation for the DuckAssistBot retrieval crawler - Google crawlers overview
Documentation distinguishing Googlebot from Google-Extended and other Google crawlers - Real LLM Bot IP checker (Tame the Bots)
Dave Smart's tool for verifying whether an IP belongs to a known LLM crawler using published ranges - Cloudflare Content Signals Policy
Cloudflare's robots.txt extension for declaring AI training and input permissions - Content Signals generator
Tool for generating Content Signals policy text for robots.txt - Common Crawl CCBot information
Documentation for the Common Crawl web archive crawler - TDMRep specification
W3C community group specification for Text and Data Mining reservation protocol - robots.txt specification
Google's documentation on robots.txt syntax and behaviour