★ RAG by corpus scaleDirect LLM · no frameworkProduction grade

RAG · BY CORPUS SCALE · 4 PLAYBOOKS

RAG by corpus scale. Four playbooks. The architecture changes with the size.

A 50k-chunk legal-docs RAG and a 1B-chunk hyperscale search don't share an architecture. Same discipline, very different builds. These are the four named playbooks we ship across, with the trade-offs at each transition and the case where each becomes the right answer.

pgvectorQdrantMilvusVespaCohereEval pipelines

Start a conversation →All RAG topics →

Tiers

4 (Small · Mid · Large · Massive)

Range

<100k chunks → 1B+ chunks

Default

pgvector at small/mid, dedicated at large/massive

Discipline

Eval on the actual scale, not the demo

The four-tier map.

Each tier has a recommended vector store and a default retrieval shape. The card colours rotate through the four brand gradients so the row reads as one continuum, not four siloed cards.

Tier · Small

<100k chunks

Departmental knowledge, internal docs, single-domain support, legal/compliance Q&A.

Store

Postgres + pgvector

Retrieval

Hybrid (dense + BM25) + RRF + Cohere Rerank

Tier · Mid

100k to 10M

Multi-product support, multi-corpus knowledge bases, growing SaaS deployments.

Store

Qdrant (Postgres as SoT)

Retrieval

Hybrid + BGE-reranker + contextual chunking

Tier · Large

10M to 1B

Enterprise document repositories, regulated industries, multi-tenant SaaS at scale.

Store

Milvus or Vespa

Retrieval

Multi-stage (recall-wide → precision-tight) sharded

Tier · Massive

1B+

Hyperscale search products, internet-scale knowledge, multi-region deployments.

Store

Vespa or custom

Retrieval

Distributed routed retrieval + GPU rerank fleet

[TIER BY TIER]

Four named playbooks.

Each playbook names the database, the retrieval shape, the rerank, the chunking, and the engagement framing.

01Tier · Small

<100k chunks

Engagement: Eight-week sprint

One Postgres holds the source documents, the embeddings, the metadata, and the access control. pgvector with HNSW serves the dense retrieval. ts_vector with BM25 serves the lexical. Reciprocal-rank fusion runs in the app, top-K goes through Cohere Rerank v3, citations come back. Almost every legal / compliance / internal-knowledge RAG sits here.

Database: Postgres + pgvector + HNSW
Retrieval: pgvector dense + ts_vector BM25 + RRF
Rerank: Cohere Rerank v3 on top-50
Embeddings: Cohere embed v3 or OpenAI text-embedding-3-large
Operates inside the customer's existing IDP, VPC, and Postgres backups

Reference build · Affidavit Mapp (court-ready legal documents)

Query

Embed (Cohere v3)

pgvector HNSW

BM25 (ts_vector)

RRF fuse top-50

Cohere Rerank v3

LLM + citations

02Tier · Mid

100k to 10M chunks

Engagement: Eight-week sprint or two-phase program

pgvector still works at the bottom of this range with care, but reranking and contextual chunking become non-negotiable. By the top of the range we're typically on Qdrant for dense retrieval (faster p99, lighter memory, better payload filtering), with Postgres still holding the canonical source of truth. Contextual retrieval (Anthropic's chunk-prefix technique) is the chunking default. The +35% retrieval accuracy on Anthropic's published evals translates directly into less reranker work.

Database: Postgres source-of-truth + Qdrant for vector index
Retrieval: Qdrant hybrid (v1.9+ native) or Qdrant dense + Postgres BM25
Rerank: Cohere Rerank v3 or BGE-reranker (on-prem option)
Chunking: Recursive + contextual retrieval (Anthropic technique)
Embeddings: Cohere v3, OpenAI 3-large, or BGE-M3 (residency-constrained)

Query (HyDE optional)

Embed (BGE-M3)

Qdrant dense

Postgres BM25

RRF fuse top-100

BGE-reranker top-20

LLM + citations

03Tier · Large

10M to 1B chunks

Engagement: Multi-phase program (12-24 weeks)

Dedicated VDB territory. Milvus for pure vector workloads where billion-scale matters; Vespa when hybrid search + reranking + structured filtering need to run in one engine at sub-100ms across the whole stack. Multi-stage retrieval becomes the norm. The first stage casts a wide net (recall@200), the second tightens (rerank to top-20), and the LLM sees the final cut. Hierarchical chunking (parent-child) helps the model see context while retrieval stays precise.

Database: Milvus (vector-dominant) or Vespa (hybrid + rerank in-engine)
Retrieval: Two-stage (recall-wide → precision-tight) with sharded indexes
Chunking: Hierarchical (parent-child) for retrieval/context split
Rerank: ColBERT late-interaction or BGE-reranker at scale
Index updates: Streaming or batch, with eval gates on every release

Query rewrite

Embed + Plan

Vespa hybrid (sharded)

Top-200 candidates

ColBERT late-interaction

Top-20 to LLM

Generate + cite

04Tier · Massive

1B+ chunks

Engagement: Long-form program with platform team partnership

Yahoo / Bing / Spotify scale. The architecture stops being "a RAG" and becomes a search system. Distributed sharded indexes, query routing, approximate methods, dedicated inference fleets. Vespa is the open-source answer here; custom architectures show up in hyperscaler-internal systems. Cost-per-million-queries becomes the dominant metric, not retrieval quality (which has to be table-stakes by this scale). Eval and observability span the whole topology, not just the model call.

Database: Vespa or custom (sharded, distributed, query-routed)
Retrieval: Multi-stage with learned routing and approximate methods
Inference: Dedicated GPU fleet for embeddings + rerank
Observability: Query distribution, p50/p95/p99 per shard, drift alerts
Engagement shape: Embedded team or ongoing partnership, not a sprint

Query router

Shard A

Shard B

Shard N

Distributed hybrid retrieve

Learned aggregator

Rerank fleet (GPU)

Cached generation

Cite + return

[WHAT YOU GET]

What you get.

Right tier

Named before commitment

Eval

On your real scale, not the demo

Headroom

Migration path priced upfront

Cited

Discipline across all four tiers

[COMMON QUESTIONS]

What buyers ask before they sign.

How do I know which tier I'm in?: Count the chunks you'll actually have in production, not the documents. A typical document chunks into 5-20 chunks depending on length. So 10k documents is roughly 50k-200k chunks (low Small / high Small). 1M documents is 5M-20M chunks (Mid territory). Latency targets and query throughput shift you up or down a tier.
Can pgvector really hold 10M vectors?: Yes, with HNSW and care. Index build time and memory grow, recall stays high. We've shipped pgvector at the 5-10M range. Past 10M is where we start the Qdrant conversation in earnest. pgvector doesn't break at that point. Operational headroom thins and the trade swings.
When does the architecture become a 'search system' instead of 'a RAG'?: Roughly past 100M chunks, and definitely past 1B. At that scale, retrieval is the system. The LLM is one component of many. Query routing, learned aggregation, approximate methods, dedicated inference fleets, and cost-per-query economics become the dominant engineering concerns.
What does the eval look like at each tier?: Small: golden set of ~200 queries with expected citations, run on every PR. Mid: same plus drift detection on real production traffic. Large: same plus per-shard recall metrics and an adversarial set. Massive: continuous eval on a sampled traffic mirror, per-shard health, learned-component A/B tests. The discipline scales; the surface area grows.
Can you tell me upfront which tier my build is?: Yes, in about thirty minutes. We need to know your corpus shape (documents, average length, growth rate), your query distribution shape, your latency target, your residency requirements, and your team's ops capacity. Output is a named tier and a one-page architecture sketch.

[RELATED RAG TOPICS]

Worth a look next.

01 · RAG

Tell us the corpus shape. We will name the tier.

Thirty-minute call. Output is a tier name, an architecture sketch, and a one-page summary you can take back to the team.

Start a conversation →All RAG topics

RAG by corpus scale. Four playbooks. The architecture changes with the size.

The four-tier map.

Four named playbooks.

<100k chunks

100k to 10M chunks

10M to 1B chunks

1B+ chunks

What you get.

What buyers ask before they sign.

Worth a look next.

RAG architectures

Vector databases

Retrieval pipeline

Multimodal RAG

Tell us the corpus shape. We will name the tier.