Retrieve
Run Advanced RAG to fetch top-K passages. Self-RAG sits on top of a normal retrieval pipeline; it does not replace hybrid + rerank.
Asai et al. (2023) trained models to emit reflection tokens that decide whether retrieval is needed, evaluate the retrieved passages, and self-critique the generated answer. In production we apply the same pattern with prompt-only Self-RAG on frontier models: the loop catches bad retrievals before the user sees a confident wrong answer.
Best for: High-stakes accuracy contexts where confident wrong answers are worse than 'I do not know'. Clinical, financial, legal, regulatory. Also strong on vague or incomplete queries where retrieval quality is uncertain.
Self-RAG runs the retrieval and generation steps with explicit self-critique nodes between them. After retrieval, the model rates the retrieved passages; if weak, it triggers a query rewrite or different retrieval. After generation, the model rates its own answer against the citations; if weak, it regenerates or declines to answer.
Run Advanced RAG to fetch top-K passages. Self-RAG sits on top of a normal retrieval pipeline; it does not replace hybrid + rerank.
A critique LLM grades each retrieved passage on relevance and supportiveness against the query. Structured output: per-passage verdict with rationale.
If the retrieval scores well, generate. If weak, trigger query rewrite and re-retrieve once. If still weak, refuse to answer with a calibrated 'I do not have enough evidence'.
Standard citation-required generation step. Self-RAG inherits whatever your generation discipline already is.
Second critique LLM grades the generated answer against the cited passages on faithfulness and completeness. Structured output: ship / regenerate / refuse with rationale.
Stack is Advanced RAG plus two critique LLM calls per query. The discipline is in keeping the critique outputs structured so we can measure and improve critique quality over time.
Self-RAG is a quality-check layer; the base retrieval still needs to be strong. Naive RAG plus self-critique is not a great combination.
Mid-tier model. Structured output (per-passage verdict + rationale) so we can measure critique quality. Cheaper than running Opus twice per query.
Final answer model. Citation-required prompt template.
Same mid-tier model. We deliberately use a different size from the generator to avoid a model grading itself sympathetically.
When to refuse vs regenerate vs ship is a calibration choice. We pick the threshold on a held-out eval set and gate it in CI; conservative tuning by default.
Self-RAG adds ~50% to development time of an Advanced RAG build because the critique prompts and refusal thresholds need separate calibration. Worth it when the eval set shows confident wrong answers are the failure mode.
Ship the Advanced RAG pipeline first. Measure on the eval set, with particular attention to confident wrong answers (where faithfulness is low but answer-quality is rated high by users).
Critique LLM rates each retrieved passage on relevance + supportiveness. Structured output. Iterate on the prompt against weak retrieval cases from the eval set.
On weak critique, trigger one query rewrite and re-retrieve. One retry, not infinite; the eval set decides whether more retries help or just add cost.
Second critique LLM rates the generated answer on faithfulness to the citations. Structured output; calibrated against ground-truth faithfulness from the eval set.
Decide thresholds for ship / regenerate / refuse on a held-out set. Tune conservatively; a refusal is better than a confident wrong answer in the contexts that pick Self-RAG.
Every critique call traced with cost. Refusal rate watched as a production metric. Drift alerts on critique-quality shifts.
Original Self-RAG paper showed accuracy and faithfulness lifts on benchmarks. In production, the bigger win is reducing confident wrong answers, which the eval set quantifies as faithfulness floor not just average.
Self-RAG eval requires us to grade not just answer quality but refusal quality. A refusal on a question where ground truth exists is a regression; a refusal on a question where retrieval was genuinely thin is the system working. Our eval set splits these explicitly so both directions are measurable.
Self-RAG sits in the family of quality-check loops alongside CRAG, both heavily explored in the 2024-2026 literature. Production teams use the same shape with prompt-only Self-RAG on frontier models rather than the originally-proposed fine-tuned model.
The practitioner consensus has shifted: the value is not in the specific Self-RAG fine-tuned model, it is in the discipline of structured critique with calibrated refusal. Most production deployments use Claude or GPT for both the critique and the generation, with the critique prompt and refusal thresholds doing the work the original paper attributed to fine-tuning.
We reach for Self-RAG when the cost of a confident wrong answer is materially higher than the cost of a refusal. Most consumer chat does not meet that bar; most clinical, legal, financial, and regulatory work does.
We have shipped Self-RAG inside contexts where the audit trail mattered as much as the answer. The structured critique output becomes part of the audit trail: not just 'here is the answer' but 'here is why we trusted it'. That auditable confidence is what justifies the ~2x cost over Advanced RAG.
Corrective RAG (CRAG) when the failure mode is bad retrieval more than bad generation. Plain Advanced RAG when refusal-versus-confident-wrong is not a meaningful distinction for the workload.
Same family of quality-check loops; retrieval-focused vs Self-RAG's answer-focused.
Read playbookRelated patternSelf-RAG sits on top of Advanced; the base retrieval still needs to be strong.
Read playbookRelated patternAgentic loops often incorporate Self-RAG-style critique steps.
Read playbook