Query rewrite
Cheap LLM call rewrites the query for retrieval: expand abbreviations, add likely synonyms, decompose multi-clause questions. Optional step but lifts recall on noisy or ambiguous queries by 5-12 points.
Query rewriting plus hybrid search (dense + BM25) fused with reciprocal-rank fusion, then a cross-encoder reranker on the top-K, then generation with citation discipline. The 2026 production consensus. Our default starting point on every engagement, and the shape used in the Affidavit Mapp build.
Best for: Most production RAG. Knowledge bases, support deflection, internal search, document Q&A, and any build where retrieval quality bounds the answer quality.
Each stage of advanced RAG earns its place against a measurable improvement. Adding query rewriting buys you ambiguous-query recall; hybrid catches the exact-term matches embeddings miss; RRF gives a defensible fusion; the reranker pays for itself in Recall@5 lift; citation discipline keeps the LLM honest.
Cheap LLM call rewrites the query for retrieval: expand abbreviations, add likely synonyms, decompose multi-clause questions. Optional step but lifts recall on noisy or ambiguous queries by 5-12 points.
Dense (vector) and sparse (BM25) searches run in parallel over the same corpus. Dense catches semantic matches; sparse catches exact-term and rare-token matches embeddings miss. We use pgvector with HNSW for dense and Postgres FTS for sparse.
Merges the two ranked lists into one. Defensible (no score normalisation games), well-studied, and good enough that fancier fusion rarely earns its complexity in production.
Top-50 from fusion goes through a cross-encoder reranker (Cohere Rerank v3 or BGE-reranker-v2-m3) which scores query-document pairs directly. Lifts Recall@5 from ~0.69 to ~0.82 on published benchmarks. The single biggest quality lever after fusion.
Top-K passages plus a citation-required prompt. The model must cite each claim to its source chunk; the UI renders inline citations. Users and auditors can verify every assertion.
Our default stack for advanced RAG is built around Postgres so the relational, vector, and lexical data sit in one place. We move to dedicated stores only at scale, see the by-scale playbook for the threshold.
Picked per corpus. Cohere v3 leads on cross-lingual; OpenAI 3-large is the strongest single-language baseline. Both ship with a 1536-2048 dim that fits pgvector HNSW cleanly.
One database for text, embeddings, metadata, and access control. Matches dedicated VDBs up to ~1M vectors on equivalent compute. Add a dedicated store (Qdrant, Milvus, or Vespa) only when scale demands.
Native BM25-equivalent ranking. Skips the operational cost of running Elasticsearch alongside Postgres at small-to-mid scale.
Cohere for managed and supported; BGE for on-prem and cost-tuned. Both lift Recall@5 by 15-20 points over hybrid alone.
Picked per workload, behind a vendor-neutral abstraction. Citation-required prompt template plus structured output validation for the citation map.
Every retrieval call traced (query, retrieved chunks, scores). Golden-set eval runs on every prompt or model change. Drift watched against a held-out production sample.
An eight-week sprint for a first production-grade Advanced RAG build, sized to a corpus under 1M chunks. Longer programs phase the same shape across multi-tenant scope, regulated review, and ongoing eval ops.
We characterise the source before we pick a chunker. Document types, average length, table density, structural cues, query distribution. Drives every downstream choice.
Hierarchical or contextual chunking depending on profile. Late chunking (Jina) for long-form when it pays off. Stored alongside the original passage and structural context for citation recovery.
Embeddings computed in batches, written to pgvector with HNSW. tsvector indexes built for BM25. Metadata filters indexed for tenant + ACL pushdown at query time.
Retrieval service runs vector and BM25 in parallel, fuses with RRF, returns top-50. Tenant-scoped at the SQL layer, never at the application layer.
Cross-encoder reranker called on top-50, returns top-K (typically 5-10). Latency budget enforced at this stage; degrade gracefully on slow rerank.
Citation-required prompt. Structured-output validation on the citation map. Streaming generation with citation hints rendered as the response builds.
Golden set of 100-300 representative queries with expected citations. CI fails on Recall@5 or faithfulness regression. Ratchet over time as the prompt + model evolve.
Traces shipped to your SIEM with PII scrubbed at the proxy. Cost per call, p50/p95/p99 latency, eval pass rate over production traffic, drift alerts.
Published benchmarks plus our internal numbers from production engagements. The reranker is the single biggest quality lever; hybrid is what makes the reranker have something to rank.
Our eval suite combines a static golden set (200 representative queries, hand-graded citations) with an LLM-as-judge faithfulness check and a live sample taken weekly from production traffic. Recall@K and MRR gate the index; faithfulness and answer-quality gate the prompt plus model. We never gate ship on a single metric.
Advanced RAG is the 2026 production consensus. Almost every major RAG vendor (LangChain, LlamaIndex, Pinecone, Cohere, Weaviate, Qdrant) ships native hybrid retrieval plus rerank as the recommended path.
Practitioners report two consistent wins: the reranker pays for its compute almost always (Cohere's published numbers, third-party reproductions, Anthropic's contextual retrieval paper), and hybrid (vector + BM25) catches the long tail of exact-term queries that destroy pure-vector recall. The 2026 talk track in the field is no longer whether to do hybrid + rerank, it is which embedding and which reranker.
We default to Advanced RAG on most engagements. It is what we describe to customers as 'production RAG' without any qualifiers. When a corpus or query shape pushes past its envelope, we add one of the primary patterns on top rather than replacing it.
The reason it stays the default: every part of the pipeline is independently measurable, debuggable, and replaceable. If retrieval quality drops, you can isolate whether the embeddings drifted, whether BM25 is contributing, whether the reranker is paying off, all without rebuilding the whole system. That kind of legibility is what makes a RAG build maintainable past month six.
GraphRAG when the answers genuinely require multi-hop reasoning across linked entities. Agentic RAG when queries are heterogeneous enough that one shot of retrieval was always going to be wrong. Both are additions to the Advanced pipeline rather than replacements for it.
What you add on top when queries vary enough to need decomposition.
Read playbookRelated patternWhat you add on top when answers depend on linked-entity reasoning.
Read playbookRelated patternPost-check loop that lifts faithfulness in high-stakes contexts.
Read playbook