Departmental knowledge, internal docs, single-domain support, legal/compliance Q&A.
RAG by corpus scale. Four playbooks. The architecture changes with the size.
A 50k-chunk legal-docs RAG and a 1B-chunk hyperscale search don't share an architecture. Same discipline, very different builds. These are the four named playbooks we ship across, with the trade-offs at each transition and the case where each becomes the right answer.
The four-tier map.
Each tier has a recommended vector store and a default retrieval shape. The card colours rotate through the four brand gradients so the row reads as one continuum, not four siloed cards.
Multi-product support, multi-corpus knowledge bases, growing SaaS deployments.
Enterprise document repositories, regulated industries, multi-tenant SaaS at scale.
Hyperscale search products, internet-scale knowledge, multi-region deployments.
Four named playbooks.
Each playbook names the database, the retrieval shape, the rerank, the chunking, and the engagement framing.
<100k chunks
One Postgres holds the source documents, the embeddings, the metadata, and the access control. pgvector with HNSW serves the dense retrieval. ts_vector with BM25 serves the lexical. Reciprocal-rank fusion runs in the app, top-K goes through Cohere Rerank v3, citations come back. Almost every legal / compliance / internal-knowledge RAG sits here.
- Database: Postgres + pgvector + HNSW
- Retrieval: pgvector dense + ts_vector BM25 + RRF
- Rerank: Cohere Rerank v3 on top-50
- Embeddings: Cohere embed v3 or OpenAI text-embedding-3-large
- Operates inside the customer's existing IDP, VPC, and Postgres backups
100k – 10M chunks
pgvector still works at the bottom of this range with care, but reranking and contextual chunking become non-negotiable. By the top of the range we're typically on Qdrant for dense retrieval (faster p99, lighter memory, better payload filtering), with Postgres still holding the canonical source of truth. Contextual retrieval (Anthropic's chunk-prefix technique) is the chunking default — +35% retrieval accuracy on Anthropic's published evals translates directly into less reranker work.
- Database: Postgres source-of-truth + Qdrant for vector index
- Retrieval: Qdrant hybrid (v1.9+ native) or Qdrant dense + Postgres BM25
- Rerank: Cohere Rerank v3 or BGE-reranker (on-prem option)
- Chunking: Recursive + contextual retrieval (Anthropic technique)
- Embeddings: Cohere v3, OpenAI 3-large, or BGE-M3 (residency-constrained)
10M – 1B chunks
Dedicated VDB territory. Milvus for pure vector workloads where billion-scale matters; Vespa when hybrid search + reranking + structured filtering need to run in one engine at sub-100ms across the whole stack. Multi-stage retrieval becomes the norm — first stage casts a wide net (recall@200), second stage tightens (rerank to top-20), the LLM sees the final cut. Hierarchical chunking (parent-child) helps the model see context while retrieval stays precise.
- Database: Milvus (vector-dominant) or Vespa (hybrid + rerank in-engine)
- Retrieval: Two-stage (recall-wide → precision-tight) with sharded indexes
- Chunking: Hierarchical (parent-child) for retrieval/context split
- Rerank: ColBERT late-interaction or BGE-reranker at scale
- Index updates: Streaming or batch, with eval gates on every release
1B+ chunks
Yahoo / Bing / Spotify scale. The architecture stops being "a RAG" and becomes a search system. Distributed sharded indexes, query routing, approximate methods, dedicated inference fleets. Vespa is the open-source answer here; custom architectures show up in hyperscaler-internal systems. Cost-per-million-queries becomes the dominant metric, not retrieval quality (which has to be table-stakes by this scale). Eval and observability span the whole topology, not just the model call.
- Database: Vespa or custom (sharded, distributed, query-routed)
- Retrieval: Multi-stage with learned routing and approximate methods
- Inference: Dedicated GPU fleet for embeddings + rerank
- Observability: Query distribution, p50/p95/p99 per shard, drift alerts
- Engagement shape: Embedded team or ongoing partnership, not a sprint
What you get.
What buyers ask before they sign.
- How do I know which tier I'm in?
- Count the chunks you'll actually have in production, not the documents. A typical document chunks into 5-20 chunks depending on length. So 10k documents is roughly 50k-200k chunks (low Small / high Small). 1M documents is 5M-20M chunks (Mid territory). Latency targets and query throughput shift you up or down a tier.
- Can pgvector really hold 10M vectors?
- Yes, with HNSW and care. Index build time and memory grow, recall stays high. We've shipped pgvector at the 5-10M range. Past 10M is where we start the Qdrant conversation in earnest — it's not that pgvector breaks, it's that operational headroom thins and the trade swings.
- When does the architecture become a 'search system' instead of 'a RAG'?
- Roughly past 100M chunks, and definitely past 1B. At that scale, retrieval is the system. The LLM is one component of many. Query routing, learned aggregation, approximate methods, dedicated inference fleets, and cost-per-query economics become the dominant engineering concerns.
- What does the eval look like at each tier?
- Small: golden set of ~200 queries with expected citations, run on every PR. Mid: same plus drift detection on real production traffic. Large: same plus per-shard recall metrics and an adversarial set. Massive: continuous eval on a sampled traffic mirror, per-shard health, learned-component A/B tests. The discipline scales; the surface area grows.
- Can you tell me upfront which tier my build is?
- Yes, in about thirty minutes. We need to know your corpus shape (documents, average length, growth rate), your query distribution shape, your latency target, your residency requirements, and your team's ops capacity. Output is a named tier and a one-page architecture sketch.
Worth a look next.
RAG architectures
Naive, Advanced, Modular, Agentic, GraphRAG, CRAG, Self-RAG. Five named patterns with the decision tree for picking one.
Read moreVector databases
pgvector, Qdrant, Milvus, Weaviate, Vespa, LanceDB, Pinecone. Honest 2026 comparison and our default.
Read moreRetrieval pipeline
Embeddings, chunking, hybrid search, reranking. The four layers retrieval quality lives or dies in.
Read moreMultimodal RAG
PDFs with tables and figures. Vision LLM extraction, ColPali, BGE-M3, court-ready citations.
Read more