Why hybrid plus rerank instead of just better embeddings?

Embeddings have a recall ceiling on exact-term queries (codes, IDs, rare proper nouns) no model fully closes. BM25 catches those. Rerank then sorts the top-50 of the combined pool against the query directly, which gets the right top-5 the model sees. Each stage compensates for the previous one's blind spot.

How much does the reranker cost?

Cohere Rerank v3 is roughly $1 per 1000 calls and adds ~20-30ms p95. For most production RAG that is the cheapest quality lift you can buy. Self-hosted BGE-reranker-v2-m3 is free at the API level but you operate a GPU.

Can we run advanced RAG without pgvector?

Yes. Qdrant, Weaviate, Milvus, Vespa, and Pinecone all support the same hybrid-plus-rerank shape. We default to pgvector because for most corpora it removes a system, not because the others are wrong. See the vector-databases page for the decision matrix.

How do we evaluate quality after launch?

A golden set of 200-300 queries with hand-graded expected citations, plus a live weekly sample from production with LLM-as-judge faithfulness. Both gate prompt and model changes; the live sample also watches for drift.

★ Advanced RAGPrimary pattern · production defaultEval-gated

ADVANCED RAG · 2026 PRODUCTION DEFAULT

Advanced RAG. Hybrid retrieval, rerank, citations.

Query rewriting plus hybrid search (dense + BM25) fused with reciprocal-rank fusion, then a cross-encoder reranker on the top-K, then generation with citation discipline. The 2026 production consensus. Our default starting point on every engagement, and the shape used in the Affidavit Mapp build.

pgvectorBM25CohereOpenAIEval pipelines

Start a conversation →All architectures →

Default

Yes · most engagements start here

Stack

Postgres · pgvector · BM25

Latency adder

~30ms p95 vs naive

Accuracy adder

+17 pts Recall@5 vs hybrid alone

[AT A GLANCE]

Best for: Most production RAG. Knowledge bases, support deflection, internal search, document Q&A, and any build where retrieval quality bounds the answer quality.

Origin

Multiple lines of work; consolidated in 2024-2026 production literature

Year

2024-2026 consensus

Complexity

Medium

Production stage

Mature

[THE PIPELINE]

Five stages, all instrumented.

Each stage of advanced RAG earns its place against a measurable improvement. Adding query rewriting buys you ambiguous-query recall; hybrid catches the exact-term matches embeddings miss; RRF gives a defensible fusion; the reranker pays for itself in Recall@5 lift; citation discipline keeps the LLM honest.

Query rewrite

Cheap LLM call rewrites the query for retrieval: expand abbreviations, add likely synonyms, decompose multi-clause questions. Optional step but lifts recall on noisy or ambiguous queries by 5-12 points.

Hybrid retrieval

Dense (vector) and sparse (BM25) searches run in parallel over the same corpus. Dense catches semantic matches; sparse catches exact-term and rare-token matches embeddings miss. We use pgvector with HNSW for dense and Postgres FTS for sparse.

Reciprocal-rank fusion

Merges the two ranked lists into one. Defensible (no score normalisation games), well-studied, and good enough that fancier fusion rarely earns its complexity in production.

Cross-encoder rerank

Top-50 from fusion goes through a cross-encoder reranker (Cohere Rerank v3 or BGE-reranker-v2-m3) which scores query-document pairs directly. Lifts Recall@5 from ~0.69 to ~0.82 on published benchmarks. The single biggest quality lever after fusion.

Generation with citations

Top-K passages plus a citation-required prompt. The model must cite each claim to its source chunk; the UI renders inline citations. Users and auditors can verify every assertion.

[TECHNICAL STACK]

What we'd actually deploy.

Our default stack for advanced RAG is built around Postgres so the relational, vector, and lexical data sit in one place. We move to dedicated stores only at scale, see the by-scale playbook for the threshold.

EMBEDDING MODEL

Cohere embed v3 (multilingual) or OpenAI text-embedding-3-large

Picked per corpus. Cohere v3 leads on cross-lingual; OpenAI 3-large is the strongest single-language baseline. Both ship with a 1536-2048 dim that fits pgvector HNSW cleanly.

VECTOR STORE

PostgreSQL with pgvector + HNSW

One database for text, embeddings, metadata, and access control. Matches dedicated VDBs up to ~1M vectors on equivalent compute. Add a dedicated store (Qdrant, Milvus, or Vespa) only when scale demands.

LEXICAL SEARCH

Postgres full-text search (tsvector + GIN)

Native BM25-equivalent ranking. Skips the operational cost of running Elasticsearch alongside Postgres at small-to-mid scale.

RERANKER

Cohere Rerank v3 (managed) or BGE-reranker-v2-m3 (self-host)

Cohere for managed and supported; BGE for on-prem and cost-tuned. Both lift Recall@5 by 15-20 points over hybrid alone.

GENERATION

Claude (Anthropic) or GPT (OpenAI) via direct API

Picked per workload, behind a vendor-neutral abstraction. Citation-required prompt template plus structured output validation for the citation map.

OBSERVABILITY + EVAL

OpenTelemetry traces, golden-set eval gating CI

Every retrieval call traced (query, retrieved chunks, scores). Golden-set eval runs on every prompt or model change. Drift watched against a held-out production sample.

[HOW WE DEPLOY]

Day one to live traffic.

An eight-week sprint for a first production-grade Advanced RAG build, sized to a corpus under 1M chunks. Longer programs phase the same shape across multi-tenant scope, regulated review, and ongoing eval ops.

01
Corpus profiling
We characterise the source before we pick a chunker. Document types, average length, table density, structural cues, query distribution. Drives every downstream choice.
02
Chunking pipeline
Hierarchical or contextual chunking depending on profile. Late chunking (Jina) for long-form when it pays off. Stored alongside the original passage and structural context for citation recovery.
03
Index build
Embeddings computed in batches, written to pgvector with HNSW. tsvector indexes built for BM25. Metadata filters indexed for tenant + ACL pushdown at query time.
04
Hybrid retriever + RRF
Retrieval service runs vector and BM25 in parallel, fuses with RRF, returns top-50. Tenant-scoped at the SQL layer, never at the application layer.
05
Rerank stage
Cross-encoder reranker called on top-50, returns top-K (typically 5-10). Latency budget enforced at this stage; degrade gracefully on slow rerank.
06
Generation + citation
Citation-required prompt. Structured-output validation on the citation map. Streaming generation with citation hints rendered as the response builds.
07
Eval suite + CI gate
Golden set of 100-300 representative queries with expected citations. CI fails on Recall@5 or faithfulness regression. Ratchet over time as the prompt + model evolve.
08
Production observability
Traces shipped to your SIEM with PII scrubbed at the proxy. Cost per call, p50/p95/p99 latency, eval pass rate over production traffic, drift alerts.

[ACCURACY + BENCHMARKS]

What the numbers say.

Published benchmarks plus our internal numbers from production engagements. The reranker is the single biggest quality lever; hybrid is what makes the reranker have something to rank.

+17 pts

Recall@5 vs hybrid-only (0.69 → 0.82)

Galileo, 2026

33-47%

Accuracy gain vs naive RAG

Multiple 2026 reports

~30ms

Added p95 latency vs naive

Our engagements

100%

Answers with inline citations at handoff

Kensink default

Our eval methodology

Our eval suite combines a static golden set (200 representative queries, hand-graded citations) with an LLM-as-judge faithfulness check and a live sample taken weekly from production traffic. Recall@K and MRR gate the index; faithfulness and answer-quality gate the prompt plus model. We never gate ship on a single metric.

[COMMUNITY FEEDBACK]

What practitioners report.

Advanced RAG is the 2026 production consensus. Almost every major RAG vendor (LangChain, LlamaIndex, Pinecone, Cohere, Weaviate, Qdrant) ships native hybrid retrieval plus rerank as the recommended path.

Practitioners report two consistent wins: the reranker pays for its compute almost always (Cohere's published numbers, third-party reproductions, Anthropic's contextual retrieval paper), and hybrid (vector + BM25) catches the long tail of exact-term queries that destroy pure-vector recall. The 2026 talk track in the field is no longer whether to do hybrid + rerank, it is which embedding and which reranker.

[COMMON PITFALLS]

Skipping query rewriting on noisy or multi-clause queries. The cheap LLM call before retrieval is often the biggest single recall win.
Over-trusting RRF parameters. The default k=60 is good for most corpora; tuning past that rarely beats reranking gains.
Forgetting citation discipline in the prompt. Without it, the model will invent confident citations that point at the wrong chunk.
Treating the reranker as optional. The Recall@5 lift is too big to leave on the table.

[KENSINK LABS EVALUATION]

Our honest take.

We default to Advanced RAG on most engagements. It is what we describe to customers as 'production RAG' without any qualifiers. When a corpus or query shape pushes past its envelope, we add one of the primary patterns on top rather than replacing it.

The reason it stays the default: every part of the pipeline is independently measurable, debuggable, and replaceable. If retrieval quality drops, you can isolate whether the embeddings drifted, whether BM25 is contributing, whether the reranker is paying off, all without rebuilding the whole system. That kind of legibility is what makes a RAG build maintainable past month six.

[WHEN WE REACH FOR IT]

Production knowledge-base and document Q&A across small-to-mid corpora (under ~10M chunks).
Internal support deflection, where the eval bar is faithfulness against the source.
Compliance-adjacent search where every answer must carry a verifiable citation.
The first build of any RAG engagement, before deciding whether to add Graph or Agentic structure on top.

What we'd substitute

GraphRAG when the answers genuinely require multi-hop reasoning across linked entities. Agentic RAG when queries are heterogeneous enough that one shot of retrieval was always going to be wrong. Both are additions to the Advanced pipeline rather than replacements for it.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

Why hybrid plus rerank instead of just better embeddings?: Embeddings have a recall ceiling on exact-term queries (codes, IDs, rare proper nouns) no model fully closes. BM25 catches those. Rerank then sorts the top-50 of the combined pool against the query directly, which gets the right top-5 the model sees. Each stage compensates for the previous one's blind spot.
How much does the reranker cost?: Cohere Rerank v3 is roughly $1 per 1000 calls and adds ~20-30ms p95. For most production RAG that is the cheapest quality lift you can buy. Self-hosted BGE-reranker-v2-m3 is free at the API level but you operate a GPU.
Can we run advanced RAG without pgvector?: Yes. Qdrant, Weaviate, Milvus, Vespa, and Pinecone all support the same hybrid-plus-rerank shape. We default to pgvector because for most corpora it removes a system, not because the others are wrong. See the vector-databases page for the decision matrix.
How do we evaluate quality after launch?: A golden set of 200-300 queries with hand-graded expected citations, plus a live weekly sample from production with LLM-as-judge faithfulness. Both gate prompt and model changes; the live sample also watches for drift.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

Advanced RAG. Hybrid retrieval, rerank, citations.

Five stages, all instrumented.

Query rewrite

Hybrid retrieval

Reciprocal-rank fusion

Cross-encoder rerank

Generation with citations

What we'd actually deploy.

Cohere embed v3 (multilingual) or OpenAI text-embedding-3-large

PostgreSQL with pgvector + HNSW

Postgres full-text search (tsvector + GIN)

Cohere Rerank v3 (managed) or BGE-reranker-v2-m3 (self-host)

Claude (Anthropic) or GPT (OpenAI) via direct API

OpenTelemetry traces, golden-set eval gating CI

Day one to live traffic.

Corpus profiling

Chunking pipeline

Index build

Hybrid retriever + RRF

Rerank stage

Generation + citation

Eval suite + CI gate

Production observability

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Agentic RAG

GraphRAG

Corrective RAG (CRAG)

What buyers ask before they sign.

Bring the corpus. We'll bring the build.