Do we need a fine-tuned retrieval evaluator?

No. The original CRAG paper used T5-base fine-tuned; production deployments mostly use a cheap LLM call with a calibrated prompt. Both work; the cheap LLM call is simpler to maintain.

Which web search API?

Tavily for LLM-friendly snippets; Brave for content-licensing transparency; Bing for raw breadth. Picked per buyer constraint. All three integrate cleanly behind a small abstraction.

What about content licensing on web fallbacks?

Reviewed at integration time. Some buyers will not accept web search at all (data residency, compliance); for those we fall back to a different mode (query rewrite, escalation to human) instead. CRAG is the pattern, not specifically web search.

Self-RAG and CRAG together?

Yes, sometimes. CRAG decides whether retrieval was worth keeping; Self-RAG decides whether the answer is worth shipping. Compounded latency and cost, compounded quality on high-stakes workloads.

★ Corrective RAG (CRAG)Primary pattern · production defaultEval-gated

CORRECTIVE RAG · POST-CHECK + WEB FALLBACK

Corrective RAG. When the retrieval is weak, fall back to web search.

Yan et al. (2024) introduced an explicit retrieval-evaluation step: after the initial retrieval, a lightweight evaluator rates the retrieved documents. Strong results go to generation; weak results trigger a fallback (web search or query rewrite). The corrective loop catches bad retrievals before they bleed through to bad answers.

TavilyBrave SearchpgvectorClaudeEval pipelines

Start a conversation →All architectures →

Best for

Time-sensitive · long-tail queries

Stack

Base RAG + web fallback

Trigger

Retrieval evaluator confidence

Cost

Cheap on strong retrievals

[AT A GLANCE]

Best for: Knowledge bases with a long tail of queries the corpus does not cover well. Customer support over an evolving product, news, legal research, academic literature search. Anywhere a confident 'I do not know' is worse than reaching for web evidence.

Origin

Yan et al., Corrective RAG (2024)

Year

2024-2026

Complexity

Medium

Production stage

Mature

[THE PIPELINE]

Retrieve, evaluate, fall back if weak.

CRAG runs base retrieval, then a lightweight evaluator scores the result. Strong results pass through to generation. Weak results trigger a corrective branch: query rewrite, web search, or escalation to a richer retrieval mode. The corrective branch's evidence is decomposed, refined, and merged with the original (if any).

Base retrieval

Run Advanced RAG over the primary corpus. Returns top-K passages plus their fusion + rerank scores.

Retrieval evaluator

Lightweight model rates the result on confidence (strong / mediocre / weak). T5-base fine-tuned is the original; prompt-only mid-tier LLMs work in production.

Strong path: use

Pass through to generation with the original retrieved passages. No fallback work, no extra cost.

Weak path: web fallback

Trigger Tavily, Brave Search, or similar API. Retrieved web snippets go through decomposition (extract claims) and refinement (filter to query-relevant) before merging with any kept original passages.

Generate with the corrected context

Final answer prompt sees the union of strong original passages and refined web evidence. Citation discipline names each source.

[TECHNICAL STACK]

What we'd actually deploy.

CRAG stack is Advanced RAG plus a retrieval evaluator plus a web search API. The evaluator is the operational lever; the web search adds external evidence but also external cost and review complexity.

BASE RETRIEVAL

Advanced RAG (hybrid + rerank)

CRAG sits on top of normal retrieval. The corrective branch only fires when base retrieval is genuinely weak.

RETRIEVAL EVALUATOR

Claude Haiku or GPT-5.5 (low effort) for prompt-only

Cheap classifier-style call rates retrieval strength. Structured output: strong / mediocre / weak with rationale.

WEB SEARCH

Tavily, Brave Search, or Bing Search API

Managed web search APIs designed for LLM consumption. We pick per buyer constraint (data residency, content licensing, cost).

WEB DECOMPOSITION + REFINE

LLM-based

Web snippets get decomposed (extract claim-level statements) and refined (filter to query-relevant). Keeps the noise of raw web search from poisoning the generation prompt.

GENERATION

Claude or GPT via direct API

Citation-required prompt that handles both internal and external evidence. Internal evidence cited to chunk; external to URL.

CACHING

Web search result cache

Common weak queries get cached web results to amortise the cost of the corrective branch. Cache TTL aligned to corpus freshness needs.

[HOW WE DEPLOY]

Day one to live traffic.

CRAG deploys in two phases: get the retrieval evaluator calibrated against the eval set, then plug in the corrective branch. The first phase is harder than it sounds; the second is mostly integration work.

01
Advanced RAG baseline
Ship the Advanced RAG pipeline first. The eval set tells you which queries actually need a corrective branch vs. which are just naturally weak.
02
Retrieval evaluator
Lightweight evaluator (LLM call or fine-tuned classifier) rates retrieval strength. Calibrate on the eval set: target precision so we do not trigger fallback on retrievals that were actually fine.
03
Web search integration
Tavily / Brave / Bing API. Rate limits, cost controls, content-licensing review handled at integration time.
04
Decomposition + refinement
LLM-based filter on web snippets. Extracts claim-level evidence, drops off-topic noise. Eval-gated against a held-out web-fallback set.
05
Caching + cost control
Common weak queries cached. Per-query cost budget. Circuit breaker on the corrective branch if the daily web search budget exhausts.
06
Eval set including web-fallback cases
Eval set explicitly includes queries that should trigger fallback. We grade both the evaluator's decision and the final answer quality with the corrective branch enabled.

[ACCURACY + BENCHMARKS]

What the numbers say.

CRAG is most useful where the eval set has a long-tail of queries the corpus does not cover well. The headline metric is fallback precision (did the evaluator pick the right cases?) plus final answer quality on those cases.

+10-18%

Answer quality on long-tail queries

Original CRAG paper + 2025 reports

~5%

Production queries that trigger fallback

Typical engagement

Negligible

Added latency on the strong-retrieval path

By design

Web search cost on triggered cases

Operational reality

Our eval methodology

CRAG eval grades the evaluator's decision (precision and recall on 'should fall back'), the corrective branch quality (web evidence refinement), and the final answer (faithfulness, completeness). We split metrics so a regression in any one component is locatable without rebuilding the whole pipeline.

[COMMUNITY FEEDBACK]

What practitioners report.

CRAG is becoming a standard add-on to Advanced RAG in 2026 production. The shape is small enough to add to an existing build without restructuring; the win is concrete; the cost on the strong-retrieval path is near zero.

Practitioners report the same trade we have seen: most queries do not fire the corrective branch, so the average cost addition is small. The queries that do fire get a meaningful answer-quality lift. The hard work is calibrating the evaluator so we do not over-trigger (paying for web search on queries the corpus actually answered fine) or under-trigger (missing weak retrievals that should have fallen back).

[COMMON PITFALLS]

Evaluator that over-triggers. Pays for web search on queries the corpus already answered well.
Web evidence not refined. Raw web snippets poison the generation prompt.
Content licensing not reviewed. Some web search APIs have constraints we cannot silently inherit.
No caching. Pays repeatedly for the same long-tail query.

[KENSINK LABS EVALUATION]

Our honest take.

We reach for CRAG when the corpus has a known long-tail (time-sensitive content, evolving product docs, regulatory updates) and the buyer accepts the web-search dependency. The pattern is cheap on most queries and lifts answer quality on the long tail.

We have shipped CRAG on customer-support builds where the product docs lag the product. The corrective branch catches the newly-shipped feature questions before the docs are updated, with web evidence (often the company's own blog or release notes) cited in the answer. That bridges the docs-lag gap operationally without forcing the docs team to publish faster.

[WHEN WE REACH FOR IT]

Customer support over a fast-moving product where docs lag.
Time-sensitive workloads (news, market data, regulatory updates).
Long-tail knowledge bases with known coverage gaps.
Builds where 'I will check the web' is an acceptable fallback for the buyer.

What we'd substitute

Self-RAG when the failure mode is bad generation more than bad retrieval. Plain Advanced RAG when the long tail is not material to the workload.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

Do we need a fine-tuned retrieval evaluator?: No. The original CRAG paper used T5-base fine-tuned; production deployments mostly use a cheap LLM call with a calibrated prompt. Both work; the cheap LLM call is simpler to maintain.
Which web search API?: Tavily for LLM-friendly snippets; Brave for content-licensing transparency; Bing for raw breadth. Picked per buyer constraint. All three integrate cleanly behind a small abstraction.
What about content licensing on web fallbacks?: Reviewed at integration time. Some buyers will not accept web search at all (data residency, compliance); for those we fall back to a different mode (query rewrite, escalation to human) instead. CRAG is the pattern, not specifically web search.
Self-RAG and CRAG together?: Yes, sometimes. CRAG decides whether retrieval was worth keeping; Self-RAG decides whether the answer is worth shipping. Compounded latency and cost, compounded quality on high-stakes workloads.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

Corrective RAG. When the retrieval is weak, fall back to web search.

Retrieve, evaluate, fall back if weak.

Base retrieval

Retrieval evaluator

Strong path: use

Weak path: web fallback

Generate with the corrected context

What we'd actually deploy.

Advanced RAG (hybrid + rerank)

Claude Haiku or GPT-5.5 (low effort) for prompt-only

Tavily, Brave Search, or Bing Search API

LLM-based

Claude or GPT via direct API

Web search result cache

Day one to live traffic.

Advanced RAG baseline

Retrieval evaluator

Web search integration

Decomposition + refinement

Caching + cost control

Eval set including web-fallback cases

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Self-RAG

Advanced RAG

Adaptive RAG

What buyers ask before they sign.

Bring the corpus. We'll bring the build.