Plan
Planner LLM reads the query, names sub-questions, and assigns each to a source. Outputs a structured plan: which retriever, what filter, expected evidence shape.
An agentic loop wraps retrieval: the model decomposes the query, picks a source (or several), runs retrieval, evaluates the result, and decides whether to try again. Where one shot of retrieval was always going to be wrong, the agent earns its compute.
Best for: Heterogeneous corpora across multiple sources where one shot of retrieval cannot cover all the relevant material. Legal research, financial analysis, multi-document Q&A, enterprise search across email + docs + tickets.
Agentic RAG replaces the single retrieval shot with a planning loop. A planner LLM decomposes the query and picks sources; per-source retrievers fetch in parallel; a validator LLM checks whether the retrieved evidence answers the question; if not, the loop iterates with a refined plan.
Planner LLM reads the query, names sub-questions, and assigns each to a source. Outputs a structured plan: which retriever, what filter, expected evidence shape.
Source-specific retrievers run in parallel. Each may itself be advanced RAG (hybrid + rerank). Different sources can use different embeddings, different chunking, different filters.
Validator LLM checks: did we get what we asked for? Common rejection reasons: empty result, off-topic, contradicts another source, low confidence on extraction.
On weak validation, planner refines and re-plans. On strong validation, synthesizer LLM writes the final answer with citations spanning the sources.
Stack is mostly direct LLM calls plus per-source retrievers. The agentic loop runs in your application layer, which means tracing and budget-control matter more than they do in a single-shot RAG.
Mid-tier reasoning model. Cheap enough for multiple calls per query, strong enough for query decomposition and validation. We default to Claude Sonnet for cost.
Final-answer model is the higher-tier choice. Synthesises across cited evidence, maintains citation discipline.
Each source has its own retriever, often shape-matched: pgvector + BM25 for text, structured query for tables, dedicated vector store for code or images.
Every plan + retrieve + validate cycle traced with token cost. Hard per-query budget enforced so a runaway plan cannot spend the daily rate limit on one user.
Validator LLM returns a structured verdict (good / weak-because-X / empty), not a free-text rationale. Lets us measure failure modes and improve them surgically.
Agentic RAG is longer to build than Advanced because each source needs its own retriever and the loop needs careful budget control. We size it as a 12-week first build with two phases: source-by-source onboarding, then loop tuning.
List every corpus the agent can reach: structured DBs, doc stores, ticketing, email, code. For each, document the access pattern, latency, and per-call cost.
Build Advanced RAG or appropriate structured query per source. Each retriever exposes a consistent interface: (query, filter, top-K) returns (passage, score, citation).
Planner output is structured (sub-questions, source assignments). Schema-validated so the loop never executes a malformed plan.
Validator returns a named verdict so we can measure failure modes. Rubric is corpus-specific, eval-gated.
Final answer required to cite per claim, with citations spanning sources. Structured-output validation on the citation map.
Per-query token + dollar budget. Trace every loop iteration. Alert on runaway plans (more than N iterations or M dollars).
Golden set evals capture the agent traces, not just final answers. Lets us catch silent regressions in planning quality even when final answers look correct.
Start with a tight per-query cap. Loosen as eval pass rates stabilise. Circuit breakers cut over to a simpler retrieve-once fallback if the agent loop fails closed.
Agentic RAG is hardest to benchmark cleanly because the cost dimension is non-trivial. Published numbers consistently show accuracy lift over single-shot retrieval on multi-source and multi-hop benchmarks.
We eval Agentic RAG on a multi-source golden set where the expected answer cites at least two sources. Recall@K is computed per source. Final-answer quality is graded LLM-as-judge with a faithfulness check against the union of cited sources. Plan quality is graded separately (did the planner pick the right sources?), so we can isolate regressions in the planner vs. the synthesizer.
Agentic RAG is where the field is most active. LangGraph, LlamaIndex Agents, AutoGen, CrewAI all ship opinionated patterns. Production teams report two consistent themes: it works, and it costs.
The 2026 talk track in the field has moved from 'should we use agents for RAG' to 'how do we control cost and trace failure modes'. The shape that ships reliably is a constrained planner (named tools, structured output, hard budget) plus a validator with a rubric. The shape that doesn't is a free-form ReAct loop with no budget. Most published failure stories come from the latter.
We reach for Agentic RAG when the corpus is genuinely heterogeneous or the query distribution forces multi-source synthesis. We never reach for it as a shortcut around weak per-source retrieval.
The biggest mistake teams make is adopting Agentic RAG before they have a strong Advanced RAG baseline. If your per-source retrievers are weak, an agentic loop will just route between weak retrievers. Build Advanced first, prove retrieval quality per source, then add the agentic layer for orchestration across sources.
Advanced RAG with query rewriting for queries that are 'complex' but actually addressable from one source. Adaptive RAG when the answer is 'pick a route per query' rather than 'plan, retrieve, validate, iterate'.
Per-source retrievers inside an agent should be advanced, not naive.
Read playbookRelated patternPick-a-route cousin when the answer is which retriever, not how many.
Read playbookRelated patternOften a sub-tool of an agent: 'when the question is multi-hop, use graph'.
Read playbook