How does LLM retrieval differ from traditional search?
LLM-driven retrieval evaluates content by semantic meaning rather than term matching, and selects sources based on information gain rather than link authority. Where traditional search engines match queries against documents using lexical signals (TF-IDF, BM25) combined with authority metrics (PageRank), RAG systems encode both queries and content into high-dimensional vector representations, then identify relevant material through geometric proximity in semantic space.
This article examines RAG architecture and retrieval mechanisms in general—not the specifics of any single platform (ChatGPT, Gemini, Claude, Perplexity). While implementations vary, the underlying patterns are consistent across systems. Understanding these fundamentals is more durable than tracking individual platform behaviours, which change frequently.
For product teams and SEO practitioners, understanding these mechanisms is essential for maintaining visibility in AI systems as AI-mediated discovery grows.
From keywords to vectors
Vector embeddings represent words and phrases as coordinates in a multi-dimensional space. Content with similar meaning clusters together, regardless of the specific words used.
This addresses the vocabulary mismatch problem inherent in keyword search. However, it introduces new challenges. Embedding models encode relationships learned during training—if the model hasn't learned a specific conceptual relationship, semantically related content may appear distant in vector space and fail to be retrieved.
The computational cost of exhaustive similarity search across large corpora is prohibitive. Production systems employ Approximate Nearest Neighbour (ANN) algorithms, typically Hierarchical Navigable Small World (HNSW) graphs, to trade marginal accuracy for substantial speed gains. This introduces non-determinism: the mathematically closest match may occasionally be missed if the graph traversal terminates prematurely. This variability is one reason AI visibility tracking has significant limitations.
Hybrid retrieval and rank fusion
Semantic search alone has a weakness: it can miss content when exact terminology matters. A query for "iPhone 15 Pro Max" might retrieve content about smartphones generally, missing the specific product page. Brand names, model numbers, and technical identifiers don't always embed distinctively.
To address this, production RAG systems run two searches in parallel:
- Semantic search: Finds content that means similar things, even with different wording.
- Keyword search (BM25): Finds content containing the exact terms in the query.
The results are merged using a technique called Reciprocal Rank Fusion (RRF). Content appearing near the top of both lists gets prioritised—it's both semantically relevant and contains the right terminology. This is why including specific product names, model numbers, and industry terms in your content still matters, even in a semantic search world.
Query transformation
Raw user queries are often ambiguous or lack sufficient context for effective vector matching. RAG architectures address this through query translation:
How query transformation works
Instead of embedding the raw user query directly, RAG systems first transform it to improve retrieval quality. The original query enters the system, but what gets vectorised and sent to the retrieval engine is a modified version designed to match relevant documents more effectively.
Query transformation isn't applied universally. Systems use a preliminary classifier to evaluate whether a query requires decomposition or can be processed as-is. Straightforward factual queries receive direct processing, while ambiguous or multi-dimensional requests trigger the transformation pipeline. This filtering step occurs before the main model evaluates the query, reducing computational overhead for simple requests.
Three main approaches exist:
Decomposition (Query Fan-out): Complex queries are split into simpler sub-queries that can be processed independently. A comparative question like "How does product A compare to product B?" becomes two separate retrieval tasks—one for product A, one for product B. The system retrieves documents for each sub-query, then synthesises the results into a unified answer. This prevents the system from searching for a single document that covers both topics, which may not exist.
Hypothetical Document Embeddings (HyDE): The LLM generates an idealised answer to the query first, then uses that generated answer for retrieval instead of the original question. If a user asks "What causes database deadlocks?", the system generates a hypothetical explanation of database deadlocks, embeds that explanation, and searches for documents similar to it. This shifts matching from "query-to-document" similarity to "answer-to-document" similarity, which often produces better results because the hypothetical answer uses terminology and structure closer to actual documentation.
Reasoning-Then-Embedding (LREM): Before embedding the query, the model performs a reasoning step to articulate the user's underlying intent. A query like "best laptop under £1000" gets expanded into "Looking for laptop recommendations with specifications and prices, focusing on models currently available for purchase under £1000". This explicit reasoning captures nuance that the raw query doesn't express, improving the precision of the embedding.
Re-ranking: the second filter
Initial retrieval is fast but imprecise—it typically returns 50–100 candidate documents. Re-ranking applies a more rigorous evaluation to narrow this down to the handful that will actually inform the response.
How re-ranking works
The initial search evaluates queries and documents separately, which is fast but misses nuance. Re-ranking evaluates them together, asking: "Given this specific query, how relevant is this specific document?" This catches relevance signals that broad similarity matching misses.
Each document receives a relevance score (typically 0 to 1). Documents scoring below a confidence threshold—often around 0.75—are discarded entirely. The system would rather use fewer sources than risk grounding on marginally relevant content.
Rather than supplying complete pages or brief SERP snippets, the grounding mechanism assembles targeted excerpts from source documents. Multiple relevant sections are extracted and concatenated, creating query-specific context that isolates the most pertinent information. This selective assembly balances specificity with conciseness—the model receives enough detail to ground its response without processing redundant or tangential content.
Why this matters for content
Re-ranking is where topical precision pays off. A page that broadly covers a topic might pass initial retrieval, but a page that directly addresses the specific question scores higher in re-ranking. Content structured around clear questions and direct answers tends to perform better at this stage than comprehensive but unfocused pages.
Beyond ranking: rationale-based selection
Newer systems are moving beyond simple "find the most similar content" approaches. Instead of ranking by similarity, they select by reasoning.
The process works like this: before searching, the system generates a rationale—a statement of what evidence would be needed to answer the query properly. For a question like "What's the return policy for electronics?", the rationale might specify: "Need official policy document, specific to electronics category, with timeframes and conditions."
Retrieved content is then evaluated against this rationale, not just against the query. This approach selects content based on whether it actually answers the question, rather than whether it uses similar words. Research shows this reduces the amount of content retrieved while improving answer accuracy by over 33%.
How citations get attached
When an AI response cites your content, how did that citation decision get made? Two approaches exist:
- Cite after writing: The system generates an answer first, then searches for sources to back it up. This is prone to weak citations—sources get attached to claims they don't fully support.
- Cite while writing: The system only makes claims it can immediately ground in retrieved sources. If no source supports a statement, the statement doesn't get made.
Verification and correction
Some systems add a checking step after generation. The response is compared against cited sources, and citations that don't hold up are either replaced with better matches or removed. Claims without adequate support may be rewritten or cut entirely.
DeepMind's GopherCite takes this further: if retrieved evidence is insufficient to meet a confidence threshold, the system returns no answer rather than an unsupported one.
Information gain as a selection signal
Google's Information Gain patent describes a key determinant in source selection: the additional information a document provides beyond what other documents in the result set already cover.
In AI Overviews and RAG responses, the system seeks to synthesise complete answers. It prioritises sources offering complementary information. If Source A covers the basic definition, the system looks for Source B covering examples, statistics, or advanced nuance—not another source duplicating Source A.
Content that merely repeats consensus is mathematically redundant in vector space and is filtered during diversity/deduplication phases of generation. This elevates differentiation as a primary ranking signal.
Structured data and entity recognition
In traditional search, structured data has a well-defined role: enabling rich results, knowledge panels, and enhanced SERP features. Its role in generative AI systems is less clear.
What we know:
- RAG systems parse HTML to extract text for embedding. Cleaner, well-structured pages are easier to parse accurately.
- Entity recognition matters—systems use knowledge graphs to verify claims and identify authoritative sources. Content associated with recognised entities (brands, products, people) may receive trust signals.
What's unproven:
- Whether schema markup (FAQPage, HowTo, Article) directly influences RAG retrieval or citation selection. Testing by technical SEOs has produced inconclusive results.
- Whether structured data provides meaningful advantages beyond what clean HTML and clear content structure already offer.
The conservative position: implement structured data for its proven benefits in traditional search, but don't expect it to be a lever for AI visibility in the way it functions for rich results. Entity clarity and content quality remain the more reliable signals. See entity clarity optimisation for practical guidance.
Some RAG architectures incorporate reliability estimation, aggregating information across sources and detecting conflicts. Sources providing outlier claims without corroboration may be down-weighted—but this filtering happens at the content level, not the markup level.
Impact on traffic and visibility
The deployment of these mechanisms has precipitated measurable shifts in user behaviour and traffic patterns.
The zero-click reality
Pew Research data indicates that when an AI Overview is present, users click on citations within the summary only 1% of the time. Users are 26% more likely to end their search session after reading an AI summary compared to 16% for standard result pages.
Gartner predicts a 25% drop in total search engine volume by 2026 due to migration toward chatbots and virtual agents—representing a structural shift in discovery behaviour rather than a temporary fluctuation.
Divergence from organic rankings
RAG systems use different relevance criteria than traditional organic algorithms. Where organic search evaluates full pages using link-based authority signals, generative systems evaluate semantic chunks using embedding similarity and information gain. A page ranking #1 organically may fail to appear in AI Overviews if its content isn't structured for chunk-level extraction or lacks differentiated information.
This divergence means that traditional rank tracking provides incomplete visibility data. Content can be highly visible in generative responses while ranking poorly in organic results, or vice versa.
As AI-mediated discovery grows, visibility metrics shift accordingly. Selection rate—the frequency with which models cite your content from the pool of retrieved candidates—emerges as a more relevant indicator than CTR for AI visibility. Unlike CTR, which tracks user behaviour, selection rate reflects algorithmic citation decisions. However, selection rate remains difficult to measure reliably given current tooling constraints. This represents a fundamental shift from measuring human engagement to measuring machine preference—with the caveat that the new metric is harder to observe.
| Metric | Traditional Search | Generative Search |
|---|---|---|
| Ranking unit | Full page (URL) | Semantic chunk / passage |
| Primary signal | Backlinks, keywords | Embeddings, information gain, entities |
| Selection logic | PageRank, authority metrics | Attention weights, rationale alignment |
| User behaviour | Scan → Click | Read summary → End session |
| Traffic outcome | High CTR (top positions) | Ultra-low CTR (<1%), brand impressions |
| Key metric | Click-through rate (CTR) | Selection rate (citation frequency) |
FAQs
Does domain authority still matter for LLM visibility?
Indirectly. Most RAG pipelines don't have built-in PageRank equivalents—a well-written post from a small site can be retrieved if it's topically precise. However, systems increasingly incorporate reliability estimation and source verification, which favour established entities. The practical effect: authority matters for citation selection, but it operates through trustworthiness weights rather than link-based signals.
How do I know if my content is being used by AI systems?
Tools and search engines are beginning to offer insights—Bing shows which sites were referenced in answers. Monitor brand mentions in AI Overviews, track referral traffic from AI-integrated surfaces, and compare your content's coverage against competitors who appear consistently in generative responses. However, significant measurement limitations exist—interpret tracking data cautiously.
Should I structure content differently for RAG systems?
Yes. RAG systems chunk documents into passages before embedding. Content that is well-sectioned with each section devoted to a single subtopic produces focused embeddings that are more likely to rank highly for specific queries. A coherent paragraph answering "What is X?" is more retrievable than a wall of text covering multiple topics.
Key takeaways
- Semantic relevance is the primary retrieval signal: Embeddings capture meaning, not keywords. Comprehensive topical coverage matters more than keyword density.
- Hybrid search combines semantic and lexical matching: Include important terminology naturally—exact keywords still help, particularly for proper nouns and technical identifiers.
- Chunk structure affects retrievability: Well-sectioned content with focused paragraphs produces more precise embeddings. Each section should answer a specific question or address a single concept.
- Information gain drives citation selection: Differentiated content that adds new data points or perspectives is prioritised over content that duplicates consensus.
- Visibility is decoupling from traffic: With <1% CTR on AI citations, the value of appearing in generative responses may shift toward brand impressions and authority signals rather than direct sessions.
Further reading
- Retrieval Augmented Generation (RAG) and Semantic Search for GPTs
OpenAI's documentation on how GPTs use RAG to retrieve and ground responses in external knowledge - Retrieval-Augmented Generation for Large Language Models: A Survey (arXiv)
Comprehensive academic survey of RAG architectures and evaluation methods - Better RAG 1: Advanced Basics (Hrishi Olickel)
Practical engineering guide to RAG system design and retrieval optimisation - Citation: A Key to Building Responsible LLMs (arXiv)
Technical analysis of attribution mechanisms and citation accuracy metrics - Google's Information Gain Patent (US20200349181A1)
Primary source for understanding information gain as a ranking signal