When does Agentic RAG actually pay off?

When the eval set demonstrably shows single-shot retrieval losing because one source cannot answer the query, and the corpus is heterogeneous enough that the planner has meaningful choices to make. If both halves are not true, Advanced RAG with good query rewriting usually wins on cost.

How do we keep costs under control?

Hard per-query budget enforced in the agent loop. Structured planner output validated against a tool schema so the loop cannot execute malformed plans. Circuit breaker that cuts over to a simpler fallback on N iterations or M dollars. Plan quality eval-gated so regressions in planning are caught before they hit production traffic.

Do we need a framework like LangGraph?

No. Our default is direct LLM calls plus our own typed loop. Frameworks add abstraction we have to maintain through migrations. The agentic loop is small enough (a few hundred lines) to own outright.

Higher and more variable than Advanced RAG. We typically budget 2-5x the Advanced p95. For interactive use cases we cap iterations and degrade to a single retrieval shot if the budget exhausts. For asynchronous use cases (deep research, report generation) the higher latency is acceptable.

★ Agentic RAGPrimary pattern · production defaultEval-gated

AGENTIC RAG · TOOL-USING RETRIEVAL

Agentic RAG. The LLM decides what, where, and whether to retrieve.

An agentic loop wraps retrieval: the model decomposes the query, picks a source (or several), runs retrieval, evaluates the result, and decides whether to try again. Where one shot of retrieval was always going to be wrong, the agent earns its compute.

LLM APIClaudeOpenAIEval pipelinesTraces

Start a conversation →All architectures →

Best for

Heterogeneous corpora · multi-source

Stack

Direct LLM · per-source retrievers

Latency

Higher · multi-step

Cost

Multiple LLM calls per query

[AT A GLANCE]

Best for: Heterogeneous corpora across multiple sources where one shot of retrieval cannot cover all the relevant material. Legal research, financial analysis, multi-document Q&A, enterprise search across email + docs + tickets.

Origin

ReAct (Yao et al., 2022) lineage; popularised for RAG 2023-2024

Year

2023-2026

Complexity

Complex

Production stage

Mature

[THE PIPELINE]

Plan, retrieve, evaluate, repeat.

Agentic RAG replaces the single retrieval shot with a planning loop. A planner LLM decomposes the query and picks sources; per-source retrievers fetch in parallel; a validator LLM checks whether the retrieved evidence answers the question; if not, the loop iterates with a refined plan.

Plan

Planner LLM reads the query, names sub-questions, and assigns each to a source. Outputs a structured plan: which retriever, what filter, expected evidence shape.

Per-source retrieve (parallel)

Source-specific retrievers run in parallel. Each may itself be advanced RAG (hybrid + rerank). Different sources can use different embeddings, different chunking, different filters.

Validate

Validator LLM checks: did we get what we asked for? Common rejection reasons: empty result, off-topic, contradicts another source, low confidence on extraction.

Iterate or synthesize

On weak validation, planner refines and re-plans. On strong validation, synthesizer LLM writes the final answer with citations spanning the sources.

[TECHNICAL STACK]

What we'd actually deploy.

Stack is mostly direct LLM calls plus per-source retrievers. The agentic loop runs in your application layer, which means tracing and budget-control matter more than they do in a single-shot RAG.

PLANNER + VALIDATOR LLM

Claude Sonnet or GPT-5.5

Mid-tier reasoning model. Cheap enough for multiple calls per query, strong enough for query decomposition and validation. We default to Claude Sonnet for cost.

SYNTHESIZER LLM

Claude Opus or GPT-5.5 (high effort)

Final-answer model is the higher-tier choice. Synthesises across cited evidence, maintains citation discipline.

PER-SOURCE RETRIEVERS

Advanced RAG per source

Each source has its own retriever, often shape-matched: pgvector + BM25 for text, structured query for tables, dedicated vector store for code or images.

TRACE + COST OBSERVABILITY

OpenTelemetry + per-query budget

Every plan + retrieve + validate cycle traced with token cost. Hard per-query budget enforced so a runaway plan cannot spend the daily rate limit on one user.

VALIDATION RUBRIC

Structured output + named failure modes

Validator LLM returns a structured verdict (good / weak-because-X / empty), not a free-text rationale. Lets us measure failure modes and improve them surgically.

[HOW WE DEPLOY]

Day one to live traffic.

Agentic RAG is longer to build than Advanced because each source needs its own retriever and the loop needs careful budget control. We size it as a 12-week first build with two phases: source-by-source onboarding, then loop tuning.

01
Source inventory
List every corpus the agent can reach: structured DBs, doc stores, ticketing, email, code. For each, document the access pattern, latency, and per-call cost.
02
Per-source retriever
Build Advanced RAG or appropriate structured query per source. Each retriever exposes a consistent interface: (query, filter, top-K) returns (passage, score, citation).
03
Planner prompt + schema
Planner output is structured (sub-questions, source assignments). Schema-validated so the loop never executes a malformed plan.
04
Validator prompt + rubric
Validator returns a named verdict so we can measure failure modes. Rubric is corpus-specific, eval-gated.
05
Synthesizer prompt + citation map
Final answer required to cite per claim, with citations spanning sources. Structured-output validation on the citation map.
06
Budget + trace
Per-query token + dollar budget. Trace every loop iteration. Alert on runaway plans (more than N iterations or M dollars).
07
Eval set with agent traces
Golden set evals capture the agent traces, not just final answers. Lets us catch silent regressions in planning quality even when final answers look correct.
08
Production rollout with circuit breakers
Start with a tight per-query cap. Loosen as eval pass rates stabilise. Circuit breakers cut over to a simpler retrieve-once fallback if the agent loop fails closed.

[ACCURACY + BENCHMARKS]

What the numbers say.

Agentic RAG is hardest to benchmark cleanly because the cost dimension is non-trivial. Published numbers consistently show accuracy lift over single-shot retrieval on multi-source and multi-hop benchmarks.

+20-35%

Accuracy gain on multi-source benchmarks vs Advanced RAG

2025-2026 reports

2-5x

Per-query token cost vs Advanced RAG

Direct comparison

Variable

Latency p95 (depends on iteration count)

Workload-dependent

Faithfulness

Same eval gating as Advanced

Kensink default

Our eval methodology

We eval Agentic RAG on a multi-source golden set where the expected answer cites at least two sources. Recall@K is computed per source. Final-answer quality is graded LLM-as-judge with a faithfulness check against the union of cited sources. Plan quality is graded separately (did the planner pick the right sources?), so we can isolate regressions in the planner vs. the synthesizer.

[COMMUNITY FEEDBACK]

What practitioners report.

Agentic RAG is where the field is most active. LangGraph, LlamaIndex Agents, AutoGen, CrewAI all ship opinionated patterns. Production teams report two consistent themes: it works, and it costs.

The 2026 talk track in the field has moved from 'should we use agents for RAG' to 'how do we control cost and trace failure modes'. The shape that ships reliably is a constrained planner (named tools, structured output, hard budget) plus a validator with a rubric. The shape that doesn't is a free-form ReAct loop with no budget. Most published failure stories come from the latter.

[COMMON PITFALLS]

No per-query budget. Cost can blow up by 10-100x on edge cases.
Free-form planner output instead of structured. Hard to validate, hard to debug, easy to derail.
Validator LLM grading itself. Use a different model from the planner if you can.
Treating the agent loop as an excuse not to do Advanced RAG well. The per-source retrievers still need to be good.

[KENSINK LABS EVALUATION]

Our honest take.

We reach for Agentic RAG when the corpus is genuinely heterogeneous or the query distribution forces multi-source synthesis. We never reach for it as a shortcut around weak per-source retrieval.

The biggest mistake teams make is adopting Agentic RAG before they have a strong Advanced RAG baseline. If your per-source retrievers are weak, an agentic loop will just route between weak retrievers. Build Advanced first, prove retrieval quality per source, then add the agentic layer for orchestration across sources.

[WHEN WE REACH FOR IT]

Enterprise research across multiple corpora (docs + email + tickets + structured DB).
Legal and financial analysis where evidence must span multiple sources by design.
Complex Q&A where a single retrieval shot was demonstrably losing on the eval set.
Builds that already have strong Advanced RAG per source and now need to compose them.

What we'd substitute

Advanced RAG with query rewriting for queries that are 'complex' but actually addressable from one source. Adaptive RAG when the answer is 'pick a route per query' rather than 'plan, retrieve, validate, iterate'.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

When does Agentic RAG actually pay off?: When the eval set demonstrably shows single-shot retrieval losing because one source cannot answer the query, and the corpus is heterogeneous enough that the planner has meaningful choices to make. If both halves are not true, Advanced RAG with good query rewriting usually wins on cost.
How do we keep costs under control?: Hard per-query budget enforced in the agent loop. Structured planner output validated against a tool schema so the loop cannot execute malformed plans. Circuit breaker that cuts over to a simpler fallback on N iterations or M dollars. Plan quality eval-gated so regressions in planning are caught before they hit production traffic.
Do we need a framework like LangGraph?: No. Our default is direct LLM calls plus our own typed loop. Frameworks add abstraction we have to maintain through migrations. The agentic loop is small enough (a few hundred lines) to own outright.
What about latency?: Higher and more variable than Advanced RAG. We typically budget 2-5x the Advanced p95. For interactive use cases we cap iterations and degrade to a single retrieval shot if the budget exhausts. For asynchronous use cases (deep research, report generation) the higher latency is acceptable.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

Agentic RAG. The LLM decides what, where, and whether to retrieve.

Plan, retrieve, evaluate, repeat.

Plan

Per-source retrieve (parallel)

Validate

Iterate or synthesize

What we'd actually deploy.

Claude Sonnet or GPT-5.5

Claude Opus or GPT-5.5 (high effort)

Advanced RAG per source

OpenTelemetry + per-query budget

Structured output + named failure modes

Day one to live traffic.

Source inventory

Per-source retriever

Planner prompt + schema

Validator prompt + rubric

Synthesizer prompt + citation map

Budget + trace

Eval set with agent traces

Production rollout with circuit breakers

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Advanced RAG

Adaptive RAG

GraphRAG

What buyers ask before they sign.

Bring the corpus. We'll bring the build.