Fixed-size
Slide a fixed-token window over the doc. Simple, predictable, weak. Loses semantic boundaries. The fallback when we don't know better.
The model is the smaller engineering problem. The real work is upstream: picking the embedding model from your real query distribution, chunking to preserve meaning, fusing dense and lexical with RRF, and reranking the top-K with a cross-encoder. This is what 2026 production RAG actually looks like.
The order matters. Each layer is a place quality is won or lost. We measure each, named.
Tone-coded by 2026 consensus: kblue for the model-bound steps, kteal for retrieval, korange for the rerank step that does most of the quality lift.
Cohere v3 / OpenAI / BGE-M3
Contextual + late chunking
pgvector + BM25 → RRF
Cohere Rerank v3 / BGE / ColBERT
LLM with citation discipline
Five embedding models cover the 2026 production landscape. Pick by your query distribution, your residency requirements, and your context length, not the leaderboard.
| Model | Dimensions | Context length | Best for | Our take |
|---|---|---|---|---|
Cohere embed v3 closed source · API · multilingual | 1024 | 512 tokens | Multilingual production, balanced quality/cost | Default for new builds in 2026. Strong multilingual, predictable cost, ranks high on MTEB across our customer query distributions. |
OpenAI text-embedding-3-large closed source · API | 3072 (or 1024 dimensionality-reduced) | 8191 tokens | Long-document embedding, high-quality English | Strong second choice, especially when long context per chunk matters. Watch the cost at scale. |
BGE-M3 open source · multi-vector · self-host | 1024 (dense) + sparse + multi-vector | 8192 tokens | Self-hosted multilingual, multimodal pairings | When the data can't leave the VPC and we need dense + sparse + multi-vector in one model. Pair with BGE-reranker. |
Jina v3 open source · API + self-host | 1024 | 8192 tokens | Late chunking, balanced quality | Pick when late chunking is part of the design — Jina has the cleanest late-chunking story in 2026. |
Voyage v3 closed source · API | 1024 | 32k tokens | Long-context, technical domains | Strong on technical / code / scientific corpora. Cost-competitive with Cohere v3. |
We always run a head-to-head on your real queries before committing. Leaderboard winners often lose on niche domains.
The 2024-2025 advances (late chunking, contextual retrieval) meaningfully changed the playing field. We default to recursive + contextual on most document-heavy corpora.
Slide a fixed-token window over the doc. Simple, predictable, weak. Loses semantic boundaries. The fallback when we don't know better.
Split on document structure first (headings, paragraphs, sentences), back off to characters only if needed. Preserves natural boundaries.
Embed sentences, split where embeddings diverge. Better at keeping a single idea together. Costs more to build the index.
Embed small child chunks for precise retrieval, return larger parent chunks to the LLM for context. Best of both worlds.
Embed the whole doc with a long-context model, THEN chunk the embeddings. Each chunk inherits the doc's context. ~5-10 pts retrieval gain on published evals.
Use an LLM to add a one-sentence context preface to each chunk before embedding (e.g. "This chunk discusses Q3 2024 revenue from the ACME annual report"). +35% retrieval accuracy on Anthropic's published evals.
Dense + lexical fused, then cross-encoder rerank. The 2026 default for every production build.
Dense (pgvector with HNSW or a dedicated VDB) catches the semantic matches BM25 misses. BM25 catches the exact-term matches embeddings miss — product codes, names, error messages, citation IDs. RRF merges the two score lists without needing to normalise different score scales. Returns top-K to stage 04.
The top-K from stage 03 goes through a cross-encoder (Cohere Rerank v3 default, BGE-reranker open-source alt, ColBERT for late-interaction). Cross-encoders see the query AND the document together (bi-encoders see them separately), so they catch nuance the first stage can't. The cost is ~30ms p95 latency and per-query API fees. Almost always worth it.
Naive, Advanced, Modular, Agentic, GraphRAG, CRAG, Self-RAG. Five named patterns with the decision tree for picking one.
Read morepgvector, Qdrant, Milvus, Weaviate, Vespa, LanceDB, Pinecone. Honest 2026 comparison and our default.
Read moreProven designs from under 100k chunks to over 1B. The architecture changes with the scale.
Read morePDFs with tables and figures. Vision LLM extraction, ColPali, BGE-M3, court-ready citations.
Read more