How LLMs and RAG Systems Retrieve, Rank, and Cite Content

Pedro Dias Last updated: 2025-12-18 ~11 min read

How RAG systems retrieve, rank, and cite content—from vector embeddings and re-ranking to information gain and citation attribution.

RAG systems select sources based on semantic similarity and information gain rather than keywords and link authority. This article covers how retrieval-augmented generation works, from vector embeddings and re-ranking to citation attribution, and what it means for content strategy.

How does LLM retrieval differ from traditional search?

LLM-driven retrieval evaluates content by semantic meaning rather than term matching, and selects sources based on information gain rather than link authority. Where traditional search engines match queries against documents using lexical signals (TF-IDF for term importance, BM25 for relevance scoring) combined with authority metrics (PageRank), RAG systems encode both queries and content into high-dimensional vector representations, then identify relevant material by how close they are in meaning.

This article examines RAG architecture and retrieval mechanisms in general, not the specifics of any single platform (ChatGPT, Gemini, Claude, Perplexity). While implementations vary, the underlying patterns are consistent across systems. These fundamentals outlast individual platform behaviours, which change frequently.

From keywords to vectors

Vector embeddings represent words and phrases as coordinates in a multi-dimensional space. Content with similar meaning clusters together, regardless of the specific words used.

Note: Vector search finds content by meaning rather than exact word matches. A query about "resetting login credentials" can surface a document discussing "password recovery" because the embeddings recognise the semantic similarity.

This addresses the vocabulary mismatch problem inherent in keyword search. However, it introduces new challenges. Embedding models encode relationships learned during training. If the model hasn't learned a specific conceptual relationship, semantically related content may appear distant in vector space and fail to be retrieved.

Comparing against all content is too slow. Production systems employ Approximate Nearest Neighbour (ANN) algorithms, typically Hierarchical Navigable Small World (HNSW) graphs, to trade marginal accuracy for substantial speed gains. This introduces variability: the best match can occasionally be missed if the search ends early. This variability is one reason AI visibility tracking has significant limitations.

Hybrid retrieval and rank fusion

Semantic search alone has a weakness: it can miss content when exact terminology matters. A query for "iPhone 15 Pro Max" might retrieve content about smartphones generally, missing the specific product page. Brand names, model numbers, and technical identifiers don't always embed distinctively.

To address this, production RAG systems run two searches in parallel:

Semantic search: Finds content that means similar things, even with different wording.
Keyword search (BM25): Finds content containing the exact terms in the query.

The results are merged using a technique called Reciprocal Rank Fusion (RRF). Content appearing near the top of both lists gets prioritised: it's both semantically relevant and contains the right terminology. This is why including specific product names, model numbers, and industry terms in your content still matters, even in a semantic search world.

Query transformation

Raw user queries are often ambiguous or lack sufficient context for effective vector matching. RAG architectures address this through query translation:

How query transformation works

Instead of embedding the raw user query directly, RAG systems first transform it to improve retrieval quality. The original query enters the system, but what gets vectorised and sent to the retrieval engine is a modified version designed to match relevant documents more effectively.

Query transformation isn't applied universally. Systems use a preliminary classifier to evaluate whether a query requires decomposition or can be processed as-is. Straightforward factual queries receive direct processing, while ambiguous or multi-dimensional requests trigger the transformation pipeline. This filtering step occurs before the main model evaluates the query, reducing computational overhead for simple requests.

Three main approaches exist:

Decomposition (Query Fan-out): Complex queries are split into simpler sub-queries that can be processed independently. A comparative question like "How does product A compare to product B?" becomes two separate retrieval tasks: one for product A, one for product B. The system retrieves documents for each sub-query, then synthesises the results into a unified answer. This prevents the system from searching for a single document that covers both topics, which may not exist.

Hypothetical Document Embeddings (HyDE): The LLM generates an idealised answer to the query first, then uses that generated answer for retrieval instead of the original question. If a user asks "What causes database deadlocks?", the system generates a hypothetical explanation of database deadlocks, embeds that explanation, and searches for documents similar to it. This shifts matching from "query-to-document" similarity to "answer-to-document" similarity, which often produces better results because the hypothetical answer uses terminology and structure closer to actual documentation.

Reasoning-Then-Embedding (LREM): Before embedding the query, the model performs a reasoning step to articulate the user's underlying intent. A query like "best laptop under £1000" gets expanded into "Looking for laptop recommendations with specifications and prices, focusing on models currently available for purchase under £1000". This explicit reasoning captures nuance that the raw query doesn't express, improving the precision of the embedding.

Re-ranking: the second filter

Initial retrieval is fast but imprecise. It typically returns 50–100 candidate documents. Re-ranking applies a more rigorous evaluation to narrow this down to the handful that will actually inform the response.

Diagram illustrating the re-ranking process in RAG systems — Re-ranking evaluates query-document pairs together for higher precision

How re-ranking works

The initial search evaluates queries and documents separately, which is fast but misses nuance. Re-ranking evaluates them together, asking: "Given this specific query, how relevant is this specific document?" This catches relevance signals that broad similarity matching misses.

Each document receives a relevance score (typically 0 to 1). Documents scoring below a confidence threshold (often around 0.75) are discarded entirely. The system would rather use fewer sources than risk grounding on marginally relevant content.

Rather than supplying complete pages or brief SERP snippets, the system pulls specific excerpts from source documents. Multiple relevant sections are extracted and concatenated, creating query-specific context that isolates the most pertinent information. This approach balances detail with brevity; the model receives enough detail to ground its response without processing redundant or tangential content.

Why this matters for content

Re-ranking is where topical precision pays off. A page that broadly covers a topic might pass initial retrieval, but a page that directly addresses the specific question scores higher in re-ranking. Content structured around clear questions and direct answers tends to perform better at this stage than comprehensive but unfocused pages.

Beyond ranking: rationale-based selection

Newer systems are moving beyond simple "find the most similar content" approaches. Instead of ranking by similarity, they select by reasoning.

The process works like this: before searching, the system generates a rationale, a statement of what evidence would be needed to answer the query properly. For a question like "What's the return policy for electronics?", the rationale might specify: "Need official policy document, specific to electronics category, with timeframes and conditions."

Retrieved content is then evaluated against this rationale, not just against the query. This approach selects content based on whether it actually answers the question, rather than whether it uses similar words. Research shows this reduces the amount of content retrieved while improving answer accuracy by over 33%.

How citations get attached

When an AI response cites your content, how did that citation decision get made? Two approaches exist:

Cite after writing: The system generates an answer first, then searches for sources to back it up. This is prone to weak citations: sources get attached to claims they don't fully support.
Cite while writing: The system only makes claims it can immediately ground in retrieved sources. If no source supports a statement, the statement doesn't get made.

Verification and correction

Some systems add a checking step after generation. The response is compared against cited sources, and citations that don't hold up are either replaced with better matches or removed. Claims without adequate support may be rewritten or cut entirely.

DeepMind's GopherCite takes this further: if retrieved evidence is insufficient to meet a confidence threshold, the system returns no answer rather than an unsupported one.

Warning: Being retrieved doesn't guarantee being cited. Content can rank highly in initial retrieval but be discarded during re-ranking or citation verification. The final citation decision depends on whether your content directly supports specific claims the system wants to make.

Information gain as a selection signal

Google's Information Gain patent describes a key factor in source selection: the additional information a document provides beyond what other documents in the result set already cover.

In AI Overviews and RAG responses, the system seeks to synthesise complete answers. It prioritises sources offering complementary information. If Source A covers the basic definition, the system looks for Source B covering examples, statistics, or advanced nuance, not another source duplicating Source A.

Content that merely repeats consensus is redundant and gets filtered out when the system removes duplicates. The practical effect: unique content with original data or perspectives gets selected; consensus-repeating content gets filtered out.

Structured data and entity recognition

In traditional search, structured data has a well-defined role: enabling rich results, knowledge panels, and enhanced SERP features. Its role in generative AI systems is less clear.

What we know:

RAG systems parse HTML to extract text for embedding. Cleaner, well-structured pages are easier to parse accurately.
Entity recognition matters. Systems use knowledge graphs to verify claims and identify authoritative sources. Content associated with recognised entities (brands, products, people) may receive trust signals.

What's unproven:

Whether schema markup (FAQPage, HowTo, Article) directly influences RAG retrieval or citation selection. Testing by technical SEOs has produced inconclusive results.
Whether structured data provides meaningful advantages beyond what clean HTML and clear content structure already offer.

The conservative position: implement structured data for its proven benefits in traditional search, but don't expect it to be a lever for AI visibility in the way it functions for rich results. Entity clarity and content quality remain the more reliable signals. See entity clarity optimisation for practical guidance.

Some RAG architectures estimate source reliability, aggregating information across sources and detecting conflicts. Sources providing outlier claims without corroboration may be down-weighted, but this filtering happens at the content level, not the markup level.

Impact on traffic and visibility

RAG-based search is changing how users interact with results, and how traffic flows to source content.

The zero-click reality

Pew Research data indicates that when an AI Overview is present, users click on citations within the summary only 1% of the time. Users are 26% more likely to end their search session after reading an AI summary compared to 16% for standard result pages.

Gartner predicts a 25% drop in total search engine volume by 2026 due to migration toward chatbots and virtual agents.

Divergence from organic rankings

RAG systems use different relevance criteria than traditional organic algorithms. Where organic search evaluates full pages using link-based authority signals, generative systems evaluate semantic chunks using embedding similarity and information gain. A page ranking #1 organically may fail to appear in AI Overviews if its content isn't structured for chunk-level extraction or lacks differentiated information.

This divergence means that traditional rank tracking provides incomplete visibility data. Content can be highly visible in generative responses while ranking poorly in organic results, or vice versa.

As AI-mediated discovery grows, visibility metrics shift accordingly. Selection rate (the frequency with which models cite your content from the pool of retrieved candidates) becomes more relevant than CTR for AI visibility. Unlike CTR, which tracks user behaviour, selection rate reflects algorithmic citation decisions. However, selection rate remains difficult to measure reliably given current tooling constraints. The metric that matters is moving from human clicks to machine citations, but machine citations are harder to observe.

Metric	Traditional Search	Generative Search
Ranking unit	Full page (URL)	Semantic chunk / passage
Primary signal	Backlinks, keywords	Embeddings, information gain, entities
Selection logic	PageRank, authority metrics	Attention weights, rationale alignment
User behaviour	Scan → Click	Read summary → End session
Traffic outcome	High CTR (top positions)	Ultra-low CTR (<1%), brand impressions
Key metric	Click-through rate (CTR)	Selection rate (citation frequency)

FAQs

Does domain authority still matter for LLM visibility?

Indirectly. Most RAG pipelines don't have built-in PageRank equivalents; a well-written post from a small site can be retrieved if it's topically precise. However, systems increasingly incorporate reliability estimation and source verification, which favour established entities. The practical effect: authority matters for citation selection, but it operates through trustworthiness weights rather than link-based signals.

How do I know if my content is being used by AI systems?

Tools and search engines are beginning to offer insights. Bing shows which sites were referenced in answers. Monitor brand mentions in AI Overviews, track referral traffic from AI-integrated surfaces, and compare your content's coverage against competitors who appear consistently in generative responses. However, significant measurement limitations exist; interpret tracking data cautiously.

Should I structure content differently for RAG systems?

Yes, but the changes are subtle. Use clear section headings, keep each section focused on a single concept, and write paragraphs that can stand alone. A coherent paragraph answering "What is X?" is more retrievable than a wall of text covering multiple topics. This isn't a radical departure from good web writing; it's good web writing made slightly more rigorous.

Key takeaways

Semantic relevance is the primary retrieval signal: Embeddings capture meaning, not keywords. Comprehensive topical coverage matters more than keyword density.
Hybrid search combines semantic and lexical matching: Include important terminology naturally. Exact keywords still help, particularly for proper nouns and technical identifiers.
Chunk structure affects retrievability: Well-sectioned content with focused paragraphs produces more precise embeddings. Each section should answer a specific question or address a single concept.
Information gain drives citation selection: Differentiated content that adds new data points or perspectives is prioritised over content that duplicates consensus.
Visibility is decoupling from traffic: With <1% CTR on AI citations, the value of appearing in generative responses may shift toward brand impressions and authority signals rather than direct sessions.

How LLMs and RAG Systems Retrieve, Rank, and Cite Content

How does LLM retrieval differ from traditional search?

From keywords to vectors

Hybrid retrieval and rank fusion

Query transformation

How query transformation works

Re-ranking: the second filter

How re-ranking works

Why this matters for content

Beyond ranking: rationale-based selection

How citations get attached

Verification and correction

Information gain as a selection signal

Structured data and entity recognition

Impact on traffic and visibility

The zero-click reality

Divergence from organic rankings

FAQs

Does domain authority still matter for LLM visibility?

How do I know if my content is being used by AI systems?

Should I structure content differently for RAG systems?

Key takeaways

Further reading