Agentic RAG: When Retrieval Becomes a Reasoning Step

Retrieve Once Is Not Enough

A basic RAG pipeline performs a single retrieval step. The user asks a question, the system embeds the query, pulls the top-k chunks from a vector store, stuffs them into context, and asks the model to answer. When the retrieved context contains the answer, the system works. When it does not, the model either hallucinates or confidently says nothing useful.

The failure mode is not the model. It is the architecture. A single retrieval has exactly one shot to surface relevant content, and the relevance score was computed against a query the model could not refine. Real questions rarely decompose cleanly on the first pass. A question about why a customer churned requires data about that customer, data about their usage, and data about similar customers. No single embedding captures that.

Agentic RAG fixes this by treating retrieval as a step the model can invoke repeatedly, conditionally, with parameters it chooses. The model reads the first retrieved chunks, decides what it still needs, issues a follow-up query, reads those results, and keeps going until it has enough to answer. That shift — from retrieval-as-preprocessing to retrieval-as-tool — is the whole architecture.

What Changes in the Pipeline

Traditional RAG is a pipeline. Agentic RAG is a loop. That changes every component downstream.

The retriever is no longer called once per user query; it may be called five or ten times per turn. Latency per retrieval now compounds. An embedding lookup that took 80ms in the old pipeline needs to return in 30ms in the new one, because the model is going to do it six times before producing an answer. The vector store that was sufficient before may not be sufficient now.

The context window budget is no longer a fixed slot. Each retrieval adds to the conversation, and the model needs room for its own reasoning. We typically reserve 40% of the context window for retrieved content, 30% for the evolving conversation, and 30% for generation. Budgets are enforced by summarizing old retrievals once they are no longer load-bearing, not by truncation.

The ranking signal shifts. In classic RAG, you rank by semantic similarity to the query. In agentic RAG, the model might be issuing a query like "earnings calls from companies in the same sector with similar debt profiles" that no embedding was designed to serve. Hybrid retrieval — dense embeddings combined with keyword search and structured filters — becomes table stakes, not an optimization.

The Three Retrieval Tools

We typically expose three retrieval tools to an agentic RAG system and let the model pick. Giving it one generic tool leads to over-use. Giving it too many dilutes the signal.

Semantic search for concept-driven questions. Dense embeddings, top-k with a relevance floor. The model reaches for this when the query is descriptive or thematic.

Structured query for fact-driven questions. A typed interface to the underlying data store that lets the model filter by fields it knows exist. Dates, categories, identifiers. The model reaches for this when the query contains specifics the embedding would wash out.

Cross-reference for relationship-driven questions. Given a set of entities from a prior retrieval, find everything connected to them. Graph traversal, join lookup, or a second semantic pass seeded with extracted entities. This is the tool most teams forget, and it is the one that makes multi-hop questions possible.

The model picks based on what it has retrieved so far. The first retrieval is usually semantic. The second and third often flip to structured or cross-reference once entities are known.

Stopping Is the Hardest Part

An agentic RAG loop that does not know when to stop is worse than a single-shot retrieval. It burns tokens, adds latency, and often confuses itself by adding noisy content to its own context.

We enforce stopping three ways. A hard cap on retrieval calls per turn — typically four — that the model cannot override. A sufficiency prompt that runs after each retrieval asking "can you answer the user's question from what you have, yes or no, and if no, what specifically is missing?" And a confidence check on the final answer that triggers one last retrieval if the model's own uncertainty crosses a threshold.

The sufficiency prompt is the most valuable. It forces the model to articulate the gap between what it has and what it needs, which usually produces a better next query than letting the model free-form another retrieval. It also surfaces the cases where no amount of retrieval will answer the question, so the system can say "I don't have that data" instead of looping forever.

Evaluation Gets Harder

Measuring agentic RAG is a different problem than measuring classic RAG. You cannot score a single retrieval against ground truth because there is no single retrieval. You have to evaluate the trajectory.

The metrics that matter: end-to-end answer quality scored by a judge model against reference answers, number of retrievals per turn (shorter is usually better if quality holds), and query diversity (is the model actually reformulating, or issuing the same query with minor perturbations). The last one catches a failure mode that does not exist in single-shot RAG: an agent stuck in a retrieval loop, asking variations of the same thing and not progressing.

Latency distribution matters more than median. A single-shot RAG has narrow latency. An agentic one has a long tail driven by how many retrievals the model chose. Operators need to see P95 and P99, not just averages, to know whether the system is usable at scale.

When to Reach for This

Agentic RAG is the right tool when your content is large, your queries are compositional, and your users will tolerate 3-10 seconds of response time instead of 1-2. It is the wrong tool for instant search, for tightly bounded FAQs, or for any case where a single well-chunked document reliably answers the question.

We have built production systems both ways. The agentic version costs roughly 4-6x more per query in token and retrieval compute. The payoff is answer quality on the questions that classic RAG quietly fails on — the ones with multiple entities, the ones requiring cross-reference, the ones where the right answer requires assembling information from three places. If your product depends on those questions, the cost is easy to justify. If it does not, stay simple.

If you are scoping a retrieval system and wrestling with which shape fits, we scope these in an audit and return a prioritized recommendation. Agentic RAG is a real step up in capability. It is also a real step up in complexity, and a pipeline that did not need to be a loop is worse as a loop than it was as a pipeline.

Agentic RAG: When Retrieval Becomes a Reasoning Step

Retrieve Once Is Not Enough

What Changes in the Pipeline

The Three Retrieval Tools

Stopping Is the Hardest Part

Evaluation Gets Harder

When to Reach for This

Keep reading.

Small Language Models for Vertical Agents

AI Evals in Production: The Work Nobody Sees Until It Breaks

LLM Cost Control: Strategies That Actually Move the Bill