Three types of crawlers, three different purposes
Not all bots requesting your content want the same thing. Understanding what each crawler type does with your content is essential for making informed access decisions.
Training crawlers collect content to update model weights. When GPTBot or ClaudeBot fetches your pages, that content may be incorporated into future model versions. The value extraction is permanent—the content becomes part of the model itself, usable indefinitely without further access to your site.
Retrieval crawlers fetch content at inference time to ground AI responses. When a user asks a question, these systems search for relevant content, retrieve it, and use it to inform the answer. This is the RAG (Retrieval-Augmented Generation) pattern. The content isn't absorbed into model weights—it's referenced in real-time, often with citation.
Search crawlers index content for traditional search results. Googlebot and Bingbot have operated this way for decades: crawl, index, rank, serve results with links back to your site.
| Crawler type | What happens to your content | Value exchange |
|---|---|---|
| Training | Absorbed into model weights | One-time extraction; no ongoing attribution |
| Retrieval | Fetched at query time, cited | Per-query use; potential traffic via citation |
| Search | Indexed and linked in results | Ongoing traffic via click-through |
These boundaries aren't always clean. Some crawlers serve multiple purposes, and operators don't always disclose which. Google-Extended, for example, is specifically for AI training, distinct from Googlebot's search indexing. But not all operators make such distinctions.
The access-control stack
robots.txt is the most visible access control mechanism, but it's neither the only layer nor the most enforceable. Effective access control operates across multiple layers, each with different characteristics.
robots.txt: Advisory, crawler-declared, path-level. The crawler identifies itself and checks your robots.txt for permission. Compliance is voluntary. A crawler that ignores robots.txt faces no technical barrier.
HTTP layer: Authentication (401), authorisation (403), IP allowlists, signed URLs. These are enforceable gates—if the request lacks valid credentials, the server returns an error, not the content. Hard paywalls and login requirements operate here.
Delivery layer: Edge logic at the CDN or bot management layer. Cloudflare, Akamai, and similar services can identify, rate-limit, or block traffic based on behavioural signals, IP reputation, or declared identity. This layer can act before requests reach your origin servers.
Content layer: What you actually serve. Partial responses, summaries, abstracts, or truncated feeds give crawlers something while withholding full content. Metered paywalls typically operate here—the page returns HTTP 200 but serves only a preview. API endpoints may serve structured data differently from rendered pages.
Licensing layer: Legal intent expressed through Terms of Service, licensing endpoints, ai.txt, or TDMRep declarations. These don't technically prevent access, but they establish legal grounding for enforcement actions.
Why this matters for SEO teams
Product and engineering teams often make access decisions without SEO involvement. APIs may serve content differently than web pages. Paywalls gate content that SEO teams expect to be indexed. Bot management rules may block crawlers that SEO wasn't consulted about.
Access control is a cross-functional concern. If you're responsible for organic visibility, you need visibility into what's happening at each layer—not just robots.txt.
Known AI crawlers and their stated purposes
The following table documents major AI-related crawlers, their operators, and stated purposes. This information changes frequently; verify against operator documentation before making blocking decisions.
| User-agent / Operator | Stated purpose | Verification |
|---|---|---|
| GPTBot / OpenAI | Training and retrieval for ChatGPT | Documented |
| ChatGPT-User / OpenAI | Real-time browsing for ChatGPT users | Documented |
| OAI-SearchBot / OpenAI | ChatGPT search feature | Documented |
| ClaudeBot / Anthropic | Training data collection | Limited documentation |
| anthropic-ai / Anthropic | Alternate identifier | Limited documentation |
| Google-Extended / Google | Gemini training (separate from Search) | Documented |
| Googlebot / Google | Search indexing only | Documented |
| Applebot-Extended / Apple | Apple Intelligence training | Documented |
| CCBot / Common Crawl | Open web archive (widely used for training) | Documented |
| PerplexityBot / Perplexity AI | Search and retrieval | Documented |
| Bytespider / ByteDance | Training and search | Limited documentation |
| meta-externalagent / Meta | AI training | Documented |
| Amazonbot / Amazon | Alexa and AI services | Documented |
| cohere-ai / Cohere | Training data collection | Limited documentation |
Last verified: December 2025
Training vs retrieval: often unclear
Some operators distinguish between training crawlers and retrieval crawlers. OpenAI, for example, separates GPTBot (training) from ChatGPT-User (browsing). Google separates Googlebot (search) from Google-Extended (AI training).
Others don't make this distinction, or their documentation is ambiguous. When documentation is unclear, assume the crawler may use content for training unless explicitly stated otherwise.
The crawler identity problem
User-agent strings are cheap. Verification is not.
Any HTTP client can declare itself as "GPTBot" or "Googlebot" in the User-Agent header. Search engines invested years building verification norms. Google publishes IP ranges and supports reverse DNS verification. Bing does the same. When you see a request claiming to be Googlebot, you can verify it.
AI crawler operators have not universally adopted these practices.
Verification asymmetry
| Crawler | Reverse DNS verification | Published IP ranges |
|---|---|---|
| Googlebot | Yes | Yes |
| Bingbot | Yes | Yes |
| GPTBot | Partial | Yes |
| ClaudeBot | No standard documented | No |
| CCBot | No | No |
| PerplexityBot | No standard documented | No |
For crawlers without verification mechanisms, you cannot reliably distinguish legitimate requests from spoofed ones using log analysis alone.
Operational implications
False positives in blocking: If you block based on user-agent string alone, you may block legitimate traffic that happens to match a pattern, or you may miss spoofed traffic that uses a slightly different string.
False negatives in detection: Traffic claiming to be something innocuous might actually be an undisclosed AI crawler. You can't know what you can't identify.
Over-blocking risks: Aggressive blocking of unverified "AI bot" traffic may inadvertently block legitimate retrieval systems that could drive traffic via citations. The visibility in AI systems depends partly on retrieval access.
What log analysis can and cannot reveal
Server logs show requests and their declared identity. They cannot confirm:
- Whether the declared identity is truthful
- Whether the crawler is training, retrieving, or both
- What happens to your content after it's fetched
See Log File Analysis for techniques to identify crawler traffic, with appropriate scepticism about unverifiable claims.
Two separate risk axes
Decisions about AI crawler access often conflate two different questions:
- Will blocking this harm my search visibility?
- Will allowing this leak competitive or proprietary value?
These are orthogonal concerns with different answers depending on the crawler.
Search equity risk
Blocking crawlers that influence search rankings directly affects organic traffic. The impact of blocking Googlebot is obvious: your pages won't be indexed. But some AI-related crawlers have no relationship to search visibility.
Extraction risk
Allowing crawlers to access your content exposes it to potential use in training, which may:
- Reduce the need for users to visit your site (answers synthesised from your content)
- Enable competitors to benefit from models trained on your data
- Create no reciprocal value if the model doesn't cite sources
The risk matrix
| Crawler | Search equity risk if blocked | Extraction risk if allowed |
|---|---|---|
| Googlebot | High (no indexing) | Low (search indexing, not training) |
| Bingbot | Medium (Bing visibility) | Low (search indexing, not training) |
| Google-Extended | None (search unaffected) | Medium (Gemini training) |
| GPTBot | Low/None | Medium-High (model training) |
| ChatGPT-User | Low (no search impact) | Low (real-time retrieval, may cite) |
| ClaudeBot | None | Medium-High (model training) |
| CCBot | None directly | High (Common Crawl widely used for training) |
| PerplexityBot | Low | Low-Medium (retrieval with citation) |
Using the matrix for decisions
For content where extraction risk is low (public marketing pages, general informational content), broad access makes sense. The discovery value outweighs extraction concerns.
For content where extraction risk is high (proprietary research, premium content, competitive intelligence), block training crawlers while considering whether retrieval access with citation provides acceptable value exchange.
For content where search equity is critical (core landing pages, product pages), tread carefully. Distinguish between crawlers that affect search (Googlebot, Bingbot) and those that don't (most AI training crawlers).
robots.txt for AI crawler management
robots.txt remains the primary mechanism for communicating access preferences to crawlers. Its limitations are significant, but it's the most widely supported signal.
Basic syntax
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Disallow:
This configuration blocks several AI training crawlers while explicitly allowing Googlebot for search indexing.
Common configurations
Block all AI training, allow search:
# AI training crawlers - block
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Google-Extended
User-agent: Bytespider
User-agent: meta-externalagent
User-agent: cohere-ai
Disallow: /
# Search crawlers - allow
User-agent: Googlebot
User-agent: Bingbot
Disallow:
# Retrieval crawlers - allow (they cite sources)
User-agent: ChatGPT-User
User-agent: PerplexityBot
Disallow:
Selective path blocking:
User-agent: GPTBot
User-agent: ClaudeBot
Disallow: /
Allow: /marketing/
This allows AI crawlers to access marketing content while blocking everything else.
Limitations to remember
- Advisory only: Crawlers choose whether to comply
- Crawler-declared: You're trusting the user-agent string
- Path-level: Cannot distinguish by content type, only URL pattern
- No enforcement: No technical barrier prevents access if ignored
- No retroactive effect: Content already crawled remains in training data
Beyond robots.txt: licensing and intent signals
Several mechanisms aim to communicate AI-specific access preferences beyond robots.txt. None are universally adopted or enforced.
ai.txt
A proposed convention for declaring AI access preferences in a dedicated file. Example format:
# ai.txt
User-Agent: *
Disallow-Training: /
Allow-Retrieval: /public/
Status: Proposal stage. No major AI operators have committed to honouring it. Some implementations exist, but adoption is limited.
llms.txt
A proposed convention placing a markdown file at /llms.txt containing site information optimised for LLM consumption—essentially a human-readable summary designed to help AI systems understand your site's purpose and content.
Status: Not an accepted standard. No major AI providers (OpenAI, Anthropic, Google) or search engines have committed to checking or respecting this file. Without operator adoption, the file is unlikely to be crawled or processed. Implementation effort is better directed toward proven technical improvements such as structured data, crawl efficiency, and content quality.
Content Signals (Cloudflare)
Cloudflare's Content Signals Policy extends robots.txt with directives specifying how content may be used after access. Unlike traditional robots.txt (which controls where crawlers can go), Content Signals declare what crawlers may do with content they've fetched.
Three signals are defined:
- search: Permission to build a search index and show links or snippets in results (traditional search behaviour)
- ai-input: Permission to use content as input for AI-generated answers (RAG, AI Overviews, chatbot responses)
- ai-train: Permission to use content to train or fine-tune AI models
Example configuration:
User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=no
Allow: /
This allows traditional search indexing while prohibiting both model training and real-time AI answer generation.
The policy includes human-readable comments that frame the signals as an "express reservation of rights" under the EU's 2019 Copyright Directive, positioning them as legally significant declarations rather than mere requests.
Adoption: Cloudflare has deployed Content Signals across 3.8 million domains using its managed robots.txt feature, with defaults of search=yes and ai-train=no. The ai-input signal is left unset by default.
Status: Not a ratified protocol extension. Compliance is voluntary. Google has not confirmed whether it will respect Content Signals. The specification is released under CC0 licence to encourage broader adoption. Generate the policy at contentsignals.org.
TDMRep (Text and Data Mining Reservation Protocol)
A W3C community group specification for declaring TDM (Text and Data Mining) permissions via HTTP headers or HTML meta tags.
<meta name="tdm-reservation" content="1">
Or via HTTP header:
TDM-Reservation: 1
Status: Defined specification with limited adoption. Some EU regulatory frameworks reference TDM rights, which may increase relevance over time.
Terms of Service
Legal declarations that prohibit scraping, training, or specific uses of content. Not machine-readable, but establishes legal grounding for enforcement actions.
Status: Widely used, difficult to enforce, requires legal action to remediate violations.
Licensing endpoints
Emerging pattern where publishers offer structured licensing terms for AI training use. Content may be available for free retrieval but require licensing for training inclusion.
Status: Early stage. Some publishers have announced licensing deals with AI operators; no standardised protocol exists.
Hard limits of certainty
Certain things cannot be known with current tools and disclosures. Being explicit about these limits protects against false confidence.
You cannot audit whether previously crawled content influenced a model. If a crawler accessed your content before you blocked it, that content may already be in training data. No mechanism exists to verify inclusion or request removal from trained models.
You cannot verify whether a "block" prevented inclusion. robots.txt compliance is voluntary. A crawler may have read your robots.txt, ignored it, and fetched content anyway. You have no visibility into this.
You cannot reliably map crawlers to downstream features. When your content appears in an AI-generated answer, determining which crawl (training or retrieval) contributed is often impossible.
You cannot enforce ai.txt, llms.txt, or TDM signals. These signals declare intent. Whether any operator reads or respects them is opaque.
You cannot distinguish all training from retrieval crawlers. Some operators don't separate these functions. Some don't document their crawlers at all.
This uncertainty is structural, not a temporary gap. Access control decisions must account for these unknowns rather than assuming visibility that doesn't exist.
Practical implementation framework
Use the following framework to make access decisions based on content type and business model.
| Content type | Recommended approach | Rationale |
|---|---|---|
| Public marketing | Allow broadly | Discovery value exceeds extraction risk |
| Blog / Editorial | Allow broadly; consider blocking training | Attribution via retrieval has value; training doesn't |
| Product pages | Allow search; consider blocking training | Search visibility critical; training value unclear |
| Premium / Paywalled | Block training; gate at HTTP layer | Protect commercial value |
| Proprietary research | Block training; delivery-layer enforcement | High extraction risk warrants strong controls |
| User-generated content | Complex; review licensing terms | May have legal constraints on third-party use |
Implementation checklist
- Audit current state: What crawlers are accessing your site? (Log analysis)
- Classify content: Which sections have different risk profiles?
- Choose enforcement layers: robots.txt alone, or additional HTTP/delivery controls?
- Implement robots.txt: Block training crawlers; explicitly allow search crawlers
- Consider additional signals: TDMRep headers, Terms of Service updates
- Monitor and adjust: Track crawler behaviour post-implementation
Testing approach
After implementing blocks:
- Verify robots.txt is accessible and correctly formatted (use a robots.txt validator for syntax)
- Check server logs for crawler requests and responses
- Monitor for new/unknown crawler user-agents
- Review Search Console for unexpected indexing changes (if search crawlers were affected)
What to watch
This space evolves rapidly. The following signals indicate where things may be heading.
Crawler identity practices: Will AI operators adopt verification norms similar to search engines? Standardised verification would enable more confident access control decisions.
Licensing-backed retrieval: Commercial models where content access is negotiated rather than scraped. If this becomes standard, the "block everything" approach may give way to selective licensing arrangements.
Search and RAG convergence: Google and others are blending traditional indexing with retrieval-augmented generation. The distinction between "search crawler" and "AI crawler" may become less clear as search itself incorporates AI synthesis.
Regulatory pressure: EU AI Act, copyright litigation outcomes, and TDM opt-out enforcement may force operators to respect declared preferences. Regulatory clarity would change the enforcement equation.
robots.txt evolution: Potential for new directives or extensions specific to AI use cases. The robots.txt specification hasn't changed substantively in decades; pressure for AI-specific signals may drive evolution.
FAQs
Does blocking GPTBot affect my visibility in ChatGPT?
Blocking GPTBot prevents your content from being used in future model training. It does not block ChatGPT-User, which handles real-time browsing when users ask ChatGPT to search the web. For retrieval-based visibility, you may want to allow ChatGPT-User while blocking GPTBot.
If I block AI crawlers now, is my content already in their training data?
Possibly. If crawlers accessed your content before you implemented blocks, that content may already be incorporated into trained models. robots.txt blocks are not retroactive. There is no standard mechanism to request removal from already-trained models.
Should I block Common Crawl (CCBot)?
Common Crawl's archive is widely used for AI training. Blocking CCBot reduces exposure to this pathway. However, Common Crawl also supports legitimate research and archival purposes. The decision depends on whether you value those uses versus the training exposure risk.
How do I verify if a crawler is legitimate?
For crawlers with documented verification (Googlebot, Bingbot, GPTBot), perform reverse DNS lookup on the IP address, then forward DNS on the resulting hostname. It should resolve back to the original IP. For crawlers without published verification mechanisms, you cannot reliably verify identity.
Will blocking AI crawlers hurt my SEO?
Blocking crawlers that don't affect search indexing (GPTBot, ClaudeBot, Google-Extended, CCBot) has no direct impact on traditional SEO. Blocking Googlebot or Bingbot will harm search visibility. The key is distinguishing between search crawlers and AI training crawlers.
Key takeaways
-
Three crawler types, three purposes: Training crawlers absorb content into models; retrieval crawlers fetch content at query time; search crawlers index for traditional results. Each warrants different access decisions.
-
robots.txt is advisory, not enforcement: Compliance depends on crawler operators. The only technical enforcement occurs at the HTTP or delivery layer.
-
Verification is asymmetric: Search engines invested years in verification norms. Most AI crawlers lack equivalent mechanisms. User-agent strings can be spoofed; treat unverified identities with scepticism.
-
Separate search risk from extraction risk: Blocking GPTBot has no search impact. Blocking Googlebot does. Evaluate each crawler against both dimensions.
-
Uncertainty is structural: You cannot audit training data inclusion, verify robots.txt compliance, or know what happens to crawled content. Make decisions with these limits in mind.
Further reading
- OpenAI crawler documentation
Official documentation for GPTBot, ChatGPT-User, and OAI-SearchBot - Google crawlers overview
Documentation distinguishing Googlebot from Google-Extended and other Google crawlers - Anthropic ClaudeBot documentation
Official documentation for ClaudeBot crawler identification - Cloudflare Content Signals Policy
Cloudflare's robots.txt extension for declaring AI training and input permissions - Content Signals generator
Tool for generating Content Signals policy text for robots.txt - Common Crawl CCBot information
Documentation for the Common Crawl web archive crawler - TDMRep specification
W3C community group specification for Text and Data Mining reservation protocol - robots.txt specification
Google's documentation on robots.txt syntax and behaviour