Kensink Labs
Self-RAGPrimary pattern · production defaultEval-gated
SELF-RAG · SELF-EVALUATING RETRIEVAL

Self-RAG. The model decides if its own answer holds up.

Asai et al. (2023) trained models to emit reflection tokens that decide whether retrieval is needed, evaluate the retrieved passages, and self-critique the generated answer. In production we apply the same pattern with prompt-only Self-RAG on frontier models: the loop catches bad retrievals before the user sees a confident wrong answer.

ClaudeOpenAIEval pipelinesStructured output
Best for
Vague queries · high-stakes accuracy
Stack
Direct LLM · structured critique
Latency
~2x Advanced RAG p95
Quality lever
Catches confident wrong answers
[AT A GLANCE]

Best for: High-stakes accuracy contexts where confident wrong answers are worse than 'I do not know'. Clinical, financial, legal, regulatory. Also strong on vague or incomplete queries where retrieval quality is uncertain.

Origin
Asai et al., Self-RAG (2023)
Year
2023-2026
Complexity
Medium
Production stage
Mature
[THE PIPELINE]

Retrieve, critique, decide, repeat.

Self-RAG runs the retrieval and generation steps with explicit self-critique nodes between them. After retrieval, the model rates the retrieved passages; if weak, it triggers a query rewrite or different retrieval. After generation, the model rates its own answer against the citations; if weak, it regenerates or declines to answer.

Query
Retrieve
Critique retrieval (LLM)
(retry if weak)
Generate
Critique answer (LLM)
Strong: ship
Weak: regenerate or refuse
01

Retrieve

Run Advanced RAG to fetch top-K passages. Self-RAG sits on top of a normal retrieval pipeline; it does not replace hybrid + rerank.

02

Critique retrieval

A critique LLM grades each retrieved passage on relevance and supportiveness against the query. Structured output: per-passage verdict with rationale.

03

Decide: generate, rewrite, or refuse

If the retrieval scores well, generate. If weak, trigger query rewrite and re-retrieve once. If still weak, refuse to answer with a calibrated 'I do not have enough evidence'.

04

Generate with citation-required prompt

Standard citation-required generation step. Self-RAG inherits whatever your generation discipline already is.

05

Critique answer

Second critique LLM grades the generated answer against the cited passages on faithfulness and completeness. Structured output: ship / regenerate / refuse with rationale.

[TECHNICAL STACK]

What we'd actually deploy.

Stack is Advanced RAG plus two critique LLM calls per query. The discipline is in keeping the critique outputs structured so we can measure and improve critique quality over time.

BASE RETRIEVAL

Advanced RAG (hybrid + rerank)

Self-RAG is a quality-check layer; the base retrieval still needs to be strong. Naive RAG plus self-critique is not a great combination.

RETRIEVAL CRITIQUE LLM

Claude Sonnet or GPT-5.5

Mid-tier model. Structured output (per-passage verdict + rationale) so we can measure critique quality. Cheaper than running Opus twice per query.

GENERATION LLM

Claude Opus or GPT-5.5 (high effort)

Final answer model. Citation-required prompt template.

ANSWER CRITIQUE LLM

Claude Sonnet or GPT-5.5

Same mid-tier model. We deliberately use a different size from the generator to avoid a model grading itself sympathetically.

REFUSAL CALIBRATION

Eval-gated thresholds

When to refuse vs regenerate vs ship is a calibration choice. We pick the threshold on a held-out eval set and gate it in CI; conservative tuning by default.

[HOW WE DEPLOY]

Day one to live traffic.

Self-RAG adds ~50% to development time of an Advanced RAG build because the critique prompts and refusal thresholds need separate calibration. Worth it when the eval set shows confident wrong answers are the failure mode.

  1. 01

    Advanced RAG baseline

    Ship the Advanced RAG pipeline first. Measure on the eval set, with particular attention to confident wrong answers (where faithfulness is low but answer-quality is rated high by users).

  2. 02

    Retrieval critique prompt

    Critique LLM rates each retrieved passage on relevance + supportiveness. Structured output. Iterate on the prompt against weak retrieval cases from the eval set.

  3. 03

    Retrieval rewrite + retry loop

    On weak critique, trigger one query rewrite and re-retrieve. One retry, not infinite; the eval set decides whether more retries help or just add cost.

  4. 04

    Answer critique prompt

    Second critique LLM rates the generated answer on faithfulness to the citations. Structured output; calibrated against ground-truth faithfulness from the eval set.

  5. 05

    Refusal calibration

    Decide thresholds for ship / regenerate / refuse on a held-out set. Tune conservatively; a refusal is better than a confident wrong answer in the contexts that pick Self-RAG.

  6. 06

    Trace + cost observability

    Every critique call traced with cost. Refusal rate watched as a production metric. Drift alerts on critique-quality shifts.

[ACCURACY + BENCHMARKS]

What the numbers say.

Original Self-RAG paper showed accuracy and faithfulness lifts on benchmarks. In production, the bigger win is reducing confident wrong answers, which the eval set quantifies as faithfulness floor not just average.

+8-12%
Faithfulness vs Advanced RAG baseline
Multiple 2024-2026 reports
-30-50%
Confident-wrong-answer rate
Production engagements
~2x
Per-query token cost vs Advanced
Direct comparison
Refusal
New metric we now grade explicitly
Eval discipline
Our eval methodology

Self-RAG eval requires us to grade not just answer quality but refusal quality. A refusal on a question where ground truth exists is a regression; a refusal on a question where retrieval was genuinely thin is the system working. Our eval set splits these explicitly so both directions are measurable.

[COMMUNITY FEEDBACK]

What practitioners report.

Self-RAG sits in the family of quality-check loops alongside CRAG, both heavily explored in the 2024-2026 literature. Production teams use the same shape with prompt-only Self-RAG on frontier models rather than the originally-proposed fine-tuned model.

The practitioner consensus has shifted: the value is not in the specific Self-RAG fine-tuned model, it is in the discipline of structured critique with calibrated refusal. Most production deployments use Claude or GPT for both the critique and the generation, with the critique prompt and refusal thresholds doing the work the original paper attributed to fine-tuning.

[COMMON PITFALLS]
  • Same model for generation and critique. The model will grade its own work sympathetically.
  • Free-text critique output. Hard to measure, hard to improve, easy to skip.
  • No refusal calibration. Either too eager to ship weak answers or too eager to refuse on easy ones.
  • Treating Self-RAG as a replacement for Advanced. The base retrieval still needs to be strong.
[KENSINK LABS EVALUATION]

Our honest take.

We reach for Self-RAG when the cost of a confident wrong answer is materially higher than the cost of a refusal. Most consumer chat does not meet that bar; most clinical, legal, financial, and regulatory work does.

We have shipped Self-RAG inside contexts where the audit trail mattered as much as the answer. The structured critique output becomes part of the audit trail: not just 'here is the answer' but 'here is why we trusted it'. That auditable confidence is what justifies the ~2x cost over Advanced RAG.

[WHEN WE REACH FOR IT]
  • Clinical decision support where wrong answers can harm patients.
  • Financial analysis with regulatory exposure.
  • Legal research where citations must support claims defensibly.
  • Any context where the buyer values a calibrated 'I do not know' over a confident wrong guess.
What we'd substitute

Corrective RAG (CRAG) when the failure mode is bad retrieval more than bad generation. Plain Advanced RAG when refusal-versus-confident-wrong is not a meaningful distinction for the workload.

[COMMON QUESTIONS]

What buyers ask before they sign.

Do we need the original Self-RAG fine-tuned model?
No. Prompt-only Self-RAG on Claude or GPT delivers the same shape. The fine-tuned model is interesting research; production deployments use frontier models with structured critique prompts.
How does refusal calibration work?
We pick a threshold on the critique LLM's confidence score below which the answer is refused. The threshold is tuned on a held-out eval set with explicit refusal-quality grading. Conservative by default; we would rather refuse than ship a confident wrong.
What about latency?
Roughly 2x Advanced RAG p95 because of the two extra LLM calls. For interactive contexts we sometimes parallelize the answer critique with downstream UI updates so the user-perceived latency is closer to Advanced RAG.
Self-RAG vs CRAG?
Self-RAG critiques the generated answer; CRAG critiques the retrieved evidence. They are complementary; we have shipped builds with both, where CRAG decides whether retrieval is worth keeping and Self-RAG decides whether the answer is worth shipping.
DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.