Kensink Labs
Agentic RAGPrimary pattern · production defaultEval-gated
AGENTIC RAG · TOOL-USING RETRIEVAL

Agentic RAG. The LLM decides what, where, and whether to retrieve.

An agentic loop wraps retrieval: the model decomposes the query, picks a source (or several), runs retrieval, evaluates the result, and decides whether to try again. Where one shot of retrieval was always going to be wrong, the agent earns its compute.

LLM APIClaudeOpenAIEval pipelinesTraces
Best for
Heterogeneous corpora · multi-source
Stack
Direct LLM · per-source retrievers
Latency
Higher · multi-step
Cost
Multiple LLM calls per query
[AT A GLANCE]

Best for: Heterogeneous corpora across multiple sources where one shot of retrieval cannot cover all the relevant material. Legal research, financial analysis, multi-document Q&A, enterprise search across email + docs + tickets.

Origin
ReAct (Yao et al., 2022) lineage; popularised for RAG 2023-2024
Year
2023-2026
Complexity
Complex
Production stage
Mature
[THE PIPELINE]

Plan, retrieve, evaluate, repeat.

Agentic RAG replaces the single retrieval shot with a planning loop. A planner LLM decomposes the query and picks sources; per-source retrievers fetch in parallel; a validator LLM checks whether the retrieved evidence answers the question; if not, the loop iterates with a refined plan.

User query
Planner LLM
Source A retriever
Source B retriever
Source C retriever
Validator LLM
(retry on weak)
Synthesizer LLM
Cited answer
01

Plan

Planner LLM reads the query, names sub-questions, and assigns each to a source. Outputs a structured plan: which retriever, what filter, expected evidence shape.

02

Per-source retrieve (parallel)

Source-specific retrievers run in parallel. Each may itself be advanced RAG (hybrid + rerank). Different sources can use different embeddings, different chunking, different filters.

03

Validate

Validator LLM checks: did we get what we asked for? Common rejection reasons: empty result, off-topic, contradicts another source, low confidence on extraction.

04

Iterate or synthesize

On weak validation, planner refines and re-plans. On strong validation, synthesizer LLM writes the final answer with citations spanning the sources.

[TECHNICAL STACK]

What we'd actually deploy.

Stack is mostly direct LLM calls plus per-source retrievers. The agentic loop runs in your application layer, which means tracing and budget-control matter more than they do in a single-shot RAG.

PLANNER + VALIDATOR LLM

Claude Sonnet or GPT-5.5

Mid-tier reasoning model. Cheap enough for multiple calls per query, strong enough for query decomposition and validation. We default to Claude Sonnet for cost.

SYNTHESIZER LLM

Claude Opus or GPT-5.5 (high effort)

Final-answer model is the higher-tier choice. Synthesises across cited evidence, maintains citation discipline.

PER-SOURCE RETRIEVERS

Advanced RAG per source

Each source has its own retriever, often shape-matched: pgvector + BM25 for text, structured query for tables, dedicated vector store for code or images.

TRACE + COST OBSERVABILITY

OpenTelemetry + per-query budget

Every plan + retrieve + validate cycle traced with token cost. Hard per-query budget enforced so a runaway plan cannot spend the daily rate limit on one user.

VALIDATION RUBRIC

Structured output + named failure modes

Validator LLM returns a structured verdict (good / weak-because-X / empty), not a free-text rationale. Lets us measure failure modes and improve them surgically.

[HOW WE DEPLOY]

Day one to live traffic.

Agentic RAG is longer to build than Advanced because each source needs its own retriever and the loop needs careful budget control. We size it as a 12-week first build with two phases: source-by-source onboarding, then loop tuning.

  1. 01

    Source inventory

    List every corpus the agent can reach: structured DBs, doc stores, ticketing, email, code. For each, document the access pattern, latency, and per-call cost.

  2. 02

    Per-source retriever

    Build Advanced RAG or appropriate structured query per source. Each retriever exposes a consistent interface: (query, filter, top-K) returns (passage, score, citation).

  3. 03

    Planner prompt + schema

    Planner output is structured (sub-questions, source assignments). Schema-validated so the loop never executes a malformed plan.

  4. 04

    Validator prompt + rubric

    Validator returns a named verdict so we can measure failure modes. Rubric is corpus-specific, eval-gated.

  5. 05

    Synthesizer prompt + citation map

    Final answer required to cite per claim, with citations spanning sources. Structured-output validation on the citation map.

  6. 06

    Budget + trace

    Per-query token + dollar budget. Trace every loop iteration. Alert on runaway plans (more than N iterations or M dollars).

  7. 07

    Eval set with agent traces

    Golden set evals capture the agent traces, not just final answers. Lets us catch silent regressions in planning quality even when final answers look correct.

  8. 08

    Production rollout with circuit breakers

    Start with a tight per-query cap. Loosen as eval pass rates stabilise. Circuit breakers cut over to a simpler retrieve-once fallback if the agent loop fails closed.

[ACCURACY + BENCHMARKS]

What the numbers say.

Agentic RAG is hardest to benchmark cleanly because the cost dimension is non-trivial. Published numbers consistently show accuracy lift over single-shot retrieval on multi-source and multi-hop benchmarks.

+20-35%
Accuracy gain on multi-source benchmarks vs Advanced RAG
2025-2026 reports
2-5x
Per-query token cost vs Advanced RAG
Direct comparison
Variable
Latency p95 (depends on iteration count)
Workload-dependent
Faithfulness
Same eval gating as Advanced
Kensink default
Our eval methodology

We eval Agentic RAG on a multi-source golden set where the expected answer cites at least two sources. Recall@K is computed per source. Final-answer quality is graded LLM-as-judge with a faithfulness check against the union of cited sources. Plan quality is graded separately (did the planner pick the right sources?), so we can isolate regressions in the planner vs. the synthesizer.

[COMMUNITY FEEDBACK]

What practitioners report.

Agentic RAG is where the field is most active. LangGraph, LlamaIndex Agents, AutoGen, CrewAI all ship opinionated patterns. Production teams report two consistent themes: it works, and it costs.

The 2026 talk track in the field has moved from 'should we use agents for RAG' to 'how do we control cost and trace failure modes'. The shape that ships reliably is a constrained planner (named tools, structured output, hard budget) plus a validator with a rubric. The shape that doesn't is a free-form ReAct loop with no budget. Most published failure stories come from the latter.

[COMMON PITFALLS]
  • No per-query budget. Cost can blow up by 10-100x on edge cases.
  • Free-form planner output instead of structured. Hard to validate, hard to debug, easy to derail.
  • Validator LLM grading itself. Use a different model from the planner if you can.
  • Treating the agent loop as an excuse not to do Advanced RAG well. The per-source retrievers still need to be good.
[KENSINK LABS EVALUATION]

Our honest take.

We reach for Agentic RAG when the corpus is genuinely heterogeneous or the query distribution forces multi-source synthesis. We never reach for it as a shortcut around weak per-source retrieval.

The biggest mistake teams make is adopting Agentic RAG before they have a strong Advanced RAG baseline. If your per-source retrievers are weak, an agentic loop will just route between weak retrievers. Build Advanced first, prove retrieval quality per source, then add the agentic layer for orchestration across sources.

[WHEN WE REACH FOR IT]
  • Enterprise research across multiple corpora (docs + email + tickets + structured DB).
  • Legal and financial analysis where evidence must span multiple sources by design.
  • Complex Q&A where a single retrieval shot was demonstrably losing on the eval set.
  • Builds that already have strong Advanced RAG per source and now need to compose them.
What we'd substitute

Advanced RAG with query rewriting for queries that are 'complex' but actually addressable from one source. Adaptive RAG when the answer is 'pick a route per query' rather than 'plan, retrieve, validate, iterate'.

[COMMON QUESTIONS]

What buyers ask before they sign.

When does Agentic RAG actually pay off?
When the eval set demonstrably shows single-shot retrieval losing because one source cannot answer the query, and the corpus is heterogeneous enough that the planner has meaningful choices to make. If both halves are not true, Advanced RAG with good query rewriting usually wins on cost.
How do we keep costs under control?
Hard per-query budget enforced in the agent loop. Structured planner output validated against a tool schema so the loop cannot execute malformed plans. Circuit breaker that cuts over to a simpler fallback on N iterations or M dollars. Plan quality eval-gated so regressions in planning are caught before they hit production traffic.
Do we need a framework like LangGraph?
No. Our default is direct LLM calls plus our own typed loop. Frameworks add abstraction we have to maintain through migrations. The agentic loop is small enough (a few hundred lines) to own outright.
What about latency?
Higher and more variable than Advanced RAG. We typically budget 2-5x the Advanced p95. For interactive use cases we cap iterations and degrade to a single retrieval shot if the budget exhausts. For asynchronous use cases (deep research, report generation) the higher latency is acceptable.
DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.