★ Retrieval pipelineDirect LLM · no frameworkProduction grade

RAG · RETRIEVAL PIPELINE · FOUR LAYERS

The four layers retrieval lives in. Embed. Chunk. Hybrid retrieve. Rerank.

The model is the smaller engineering problem. The real work is upstream: picking the embedding model from your real query distribution, chunking to preserve meaning, fusing dense and lexical with RRF, and reranking the top-K with a cross-encoder. This is what 2026 production RAG actually looks like.

CohereOpenAIBGE-M3pgvectorColBERTEval pipelines

Start a conversation →All RAG topics →

Layers

4 (Embed · Chunk · Retrieve · Rerank)

Default

Cohere v3 · pgvector · BM25 · RRF · Cohere Rerank v3

Recall@5 lift

+17 pts from reranking alone

Discipline

Every layer eval-gated

[THE PIPELINE]

Every query, four stages.

The order matters. Each layer is a place quality is won or lost. We measure each, named.

What we run on every production query.

Tone-coded by 2026 consensus: kblue for the model-bound steps, kteal for retrieval, korange for the rerank step that does most of the quality lift.

01 Embed

Cohere v3 / OpenAI / BGE-M3

02 Chunk + index

Contextual + late chunking

03 Hybrid retrieve

pgvector + BM25 → RRF

04 Rerank

Cohere Rerank v3 / BGE / ColBERT

05 Generate + cite

LLM with citation discipline

01 · Embedding model selection.

Five embedding models cover the 2026 production landscape. Pick by your query distribution, your residency requirements, and your context length, not the leaderboard.

Model	Dimensions	Context length	Best for	Our take
Cohere embed v3 closed source · API · multilingual	1024	512 tokens	Multilingual production, balanced quality/cost	Default for new builds in 2026. Strong multilingual, predictable cost, ranks high on MTEB across our customer query distributions.
OpenAI text-embedding-3-large closed source · API	3072 (or 1024 dimensionality-reduced)	8191 tokens	Long-document embedding, high-quality English	Strong second choice, especially when long context per chunk matters. Watch the cost at scale.
BGE-M3 open source · multi-vector · self-host	1024 (dense) + sparse + multi-vector	8192 tokens	Self-hosted multilingual, multimodal pairings	When the data can't leave the VPC and we need dense + sparse + multi-vector in one model. Pair with BGE-reranker.
Jina v3 open source · API + self-host	1024	8192 tokens	Late chunking, balanced quality	Pick when late chunking is part of the design. Jina has the cleanest late-chunking story in 2026.
Voyage v3 closed source · API	1024	32k tokens	Long-context, technical domains	Strong on technical / code / scientific corpora. Cost-competitive with Cohere v3.

We always run a head-to-head on your real queries before committing. Leaderboard winners often lose on niche domains.

[02 · CHUNKING]

Six chunking strategies.

The 2024-2025 advances (late chunking, contextual retrieval) meaningfully changed the playing field. We default to recursive + contextual on most document-heavy corpora.

01Baseline

Fixed-size

Slide a fixed-token window over the doc. Simple, predictable, weak. Loses semantic boundaries. The fallback when we don't know better.

02Default first pass

Recursive character

Split on document structure first (headings, paragraphs, sentences), back off to characters only if needed. Preserves natural boundaries.

03When boundaries matter

Semantic chunking

Embed sentences, split where embeddings diverge. Better at keeping a single idea together. Costs more to build the index.

04When context matters

Parent-child / hierarchical

Embed small child chunks for precise retrieval, return larger parent chunks to the LLM for context. Best of both worlds.

052024 advance

Late chunking (Jina)

Embed the whole doc with a long-context model, THEN chunk the embeddings. Each chunk inherits the doc's context. ~5-10 pts retrieval gain on published evals.

062024 advance

Contextual retrieval (Anthropic)

Use an LLM to add a one-sentence context preface to each chunk before embedding (e.g. "This chunk discusses Q3 2024 revenue from the ACME annual report"). +35% retrieval accuracy on Anthropic's published evals.

[03 · HYBRID · 04 · RERANK]

Two stages do the heavy lifting.

Dense + lexical fused, then cross-encoder rerank. The 2026 default for every production build.

03 · Hybrid retrieve

Vector + BM25, fused with reciprocal-rank fusion.

Dense (pgvector with HNSW or a dedicated VDB) catches the semantic matches BM25 misses. BM25 catches the exact-term matches embeddings miss: product codes, names, error messages, citation IDs. RRF merges the two score lists without needing to normalise different score scales. Returns top-K to stage 04.

04 · Cross-encoder rerank

+17 pts of Recall@5 on the published benchmarks.

The top-K from stage 03 goes through a cross-encoder (Cohere Rerank v3 default, BGE-reranker open-source alt, ColBERT for late-interaction). Cross-encoders see the query AND the document together (bi-encoders see them separately), so they catch nuance the first stage can't. The cost is ~30ms p95 latency and per-query API fees. Almost always worth it.

[WHAT YOU GET]

What lives at handoff.

+17 pts

Recall@5 from rerank alone

Hybrid

Dense + BM25 with RRF

Eval-gated

Each layer measured separately

Tuned

Embedding model to your queries

[COMMON QUESTIONS]

What buyers ask before they sign.

Why hybrid search instead of pure vector?: Pure dense embeddings miss exact-term matches (product codes, names, error messages, regulatory citations) where lexical wins. BM25 misses semantic matches ("how do I reset my password" vs "account recovery procedure") where dense wins. Hybrid catches both. Reciprocal-rank fusion (RRF) merges the score lists without needing to normalise different score scales. 2026 production consensus is hybrid + rerank.
How much does reranking actually help?: Substantially. Published 2026 benchmarks show two-stage hybrid + cross-encoder rerank lifting Recall@5 from ~0.695 to ~0.816. That 17-point jump translates directly into faithfulness gains downstream. The cost is ~30ms p95 latency and per-query rerank API fees. Almost always worth it.
Cohere Rerank vs BGE-reranker vs ColBERT?: Cohere Rerank v3 is the API-based default: fastest to integrate, excellent quality, multilingual. BGE-reranker is the open-source pick when latency or data residency matters and you can run a model in your VPC. ColBERT does late-interaction reranking inside the retrieval step itself; pick when relevance matters more than latency.
When does contextual retrieval (Anthropic) earn the build?: When the corpus has many similar-looking documents that share keys but differ in context (annual reports across years, contracts across versions, customer tickets across products). The one-sentence LLM-generated context per chunk makes each chunk uniquely identifiable. Anthropic published ~35% retrieval accuracy gains; we typically see similar improvements on document-heavy corpora.
Do you use HyDE in production?: Selectively. HyDE (hypothetical document embedding) helps in specialist domains where query and document vocabulary diverge: medical, legal, code. Always pair with rerank to cut down hallucinated-hypothesis noise. We don't use it on broad-domain corpora where the hypothetical answer is more likely to be wrong than helpful.

[RELATED RAG TOPICS]

Worth a look next.

01 · RAG

Bring the corpus. We will tune the pipeline.

Embedding model picked from your real queries, chunking strategy matched to your documents, hybrid + rerank evaluated layer by layer. Citations on every answer.

Start a conversation →All RAG topics