Kensink Labs
RAG by corpus scaleDirect LLM · no frameworkProduction grade
RAG · BY CORPUS SCALE · 4 PLAYBOOKS

RAG by corpus scale. Four playbooks. The architecture changes with the size.

A 50k-chunk legal-docs RAG and a 1B-chunk hyperscale search don't share an architecture. Same discipline, very different builds. These are the four named playbooks we ship across, with the trade-offs at each transition and the case where each becomes the right answer.

pgvectorQdrantMilvusVespaCohereEval pipelines
Tiers
4 (Small · Mid · Large · Massive)
Range
<100k chunks → 1B+ chunks
Default
pgvector at small/mid, dedicated at large/massive
Discipline
Eval on the actual scale, not the demo

The four-tier map.

Each tier has a recommended vector store and a default retrieval shape. The card colours rotate through the four brand gradients so the row reads as one continuum, not four siloed cards.

Tier · Small
<100k chunks

Departmental knowledge, internal docs, single-domain support, legal/compliance Q&A.

Store
Postgres + pgvector
Retrieval
Hybrid (dense + BM25) + RRF + Cohere Rerank
Tier · Mid
100k – 10M

Multi-product support, multi-corpus knowledge bases, growing SaaS deployments.

Store
Qdrant (Postgres as SoT)
Retrieval
Hybrid + BGE-reranker + contextual chunking
Tier · Large
10M – 1B

Enterprise document repositories, regulated industries, multi-tenant SaaS at scale.

Store
Milvus or Vespa
Retrieval
Multi-stage (recall-wide → precision-tight) sharded
Tier · Massive
1B+

Hyperscale search products, internet-scale knowledge, multi-region deployments.

Store
Vespa or custom
Retrieval
Distributed routed retrieval + GPU rerank fleet
[TIER BY TIER]

Four named playbooks.

Each playbook names the database, the retrieval shape, the rerank, the chunking, and the engagement framing.

01Tier · Small

<100k chunks

Engagement: Eight-week sprint

One Postgres holds the source documents, the embeddings, the metadata, and the access control. pgvector with HNSW serves the dense retrieval. ts_vector with BM25 serves the lexical. Reciprocal-rank fusion runs in the app, top-K goes through Cohere Rerank v3, citations come back. Almost every legal / compliance / internal-knowledge RAG sits here.

  • Database: Postgres + pgvector + HNSW
  • Retrieval: pgvector dense + ts_vector BM25 + RRF
  • Rerank: Cohere Rerank v3 on top-50
  • Embeddings: Cohere embed v3 or OpenAI text-embedding-3-large
  • Operates inside the customer's existing IDP, VPC, and Postgres backups
Reference build · Affidavit Mapp (court-ready legal documents)
Query
Embed (Cohere v3)
pgvector HNSW
BM25 (ts_vector)
RRF fuse top-50
Cohere Rerank v3
LLM + citations
02Tier · Mid

100k – 10M chunks

Engagement: Eight-week sprint or two-phase program

pgvector still works at the bottom of this range with care, but reranking and contextual chunking become non-negotiable. By the top of the range we're typically on Qdrant for dense retrieval (faster p99, lighter memory, better payload filtering), with Postgres still holding the canonical source of truth. Contextual retrieval (Anthropic's chunk-prefix technique) is the chunking default — +35% retrieval accuracy on Anthropic's published evals translates directly into less reranker work.

  • Database: Postgres source-of-truth + Qdrant for vector index
  • Retrieval: Qdrant hybrid (v1.9+ native) or Qdrant dense + Postgres BM25
  • Rerank: Cohere Rerank v3 or BGE-reranker (on-prem option)
  • Chunking: Recursive + contextual retrieval (Anthropic technique)
  • Embeddings: Cohere v3, OpenAI 3-large, or BGE-M3 (residency-constrained)
Query (HyDE optional)
Embed (BGE-M3)
Qdrant dense
Postgres BM25
RRF fuse top-100
BGE-reranker top-20
LLM + citations
03Tier · Large

10M – 1B chunks

Engagement: Multi-phase program (12-24 weeks)

Dedicated VDB territory. Milvus for pure vector workloads where billion-scale matters; Vespa when hybrid search + reranking + structured filtering need to run in one engine at sub-100ms across the whole stack. Multi-stage retrieval becomes the norm — first stage casts a wide net (recall@200), second stage tightens (rerank to top-20), the LLM sees the final cut. Hierarchical chunking (parent-child) helps the model see context while retrieval stays precise.

  • Database: Milvus (vector-dominant) or Vespa (hybrid + rerank in-engine)
  • Retrieval: Two-stage (recall-wide → precision-tight) with sharded indexes
  • Chunking: Hierarchical (parent-child) for retrieval/context split
  • Rerank: ColBERT late-interaction or BGE-reranker at scale
  • Index updates: Streaming or batch, with eval gates on every release
Query rewrite
Embed + Plan
Vespa hybrid (sharded)
Top-200 candidates
ColBERT late-interaction
Top-20 to LLM
Generate + cite
04Tier · Massive

1B+ chunks

Engagement: Long-form program with platform team partnership

Yahoo / Bing / Spotify scale. The architecture stops being "a RAG" and becomes a search system. Distributed sharded indexes, query routing, approximate methods, dedicated inference fleets. Vespa is the open-source answer here; custom architectures show up in hyperscaler-internal systems. Cost-per-million-queries becomes the dominant metric, not retrieval quality (which has to be table-stakes by this scale). Eval and observability span the whole topology, not just the model call.

  • Database: Vespa or custom (sharded, distributed, query-routed)
  • Retrieval: Multi-stage with learned routing and approximate methods
  • Inference: Dedicated GPU fleet for embeddings + rerank
  • Observability: Query distribution, p50/p95/p99 per shard, drift alerts
  • Engagement shape: Embedded team or ongoing partnership, not a sprint
Query router
Shard A
Shard B
Shard N
Distributed hybrid retrieve
Learned aggregator
Rerank fleet (GPU)
Cached generation
Cite + return
[WHAT YOU GET]

What you get.

Right tier
Named before commitment
Eval
On your real scale, not the demo
Headroom
Migration path priced upfront
Cited
Discipline across all four tiers
[COMMON QUESTIONS]

What buyers ask before they sign.

How do I know which tier I'm in?
Count the chunks you'll actually have in production, not the documents. A typical document chunks into 5-20 chunks depending on length. So 10k documents is roughly 50k-200k chunks (low Small / high Small). 1M documents is 5M-20M chunks (Mid territory). Latency targets and query throughput shift you up or down a tier.
Can pgvector really hold 10M vectors?
Yes, with HNSW and care. Index build time and memory grow, recall stays high. We've shipped pgvector at the 5-10M range. Past 10M is where we start the Qdrant conversation in earnest — it's not that pgvector breaks, it's that operational headroom thins and the trade swings.
When does the architecture become a 'search system' instead of 'a RAG'?
Roughly past 100M chunks, and definitely past 1B. At that scale, retrieval is the system. The LLM is one component of many. Query routing, learned aggregation, approximate methods, dedicated inference fleets, and cost-per-query economics become the dominant engineering concerns.
What does the eval look like at each tier?
Small: golden set of ~200 queries with expected citations, run on every PR. Mid: same plus drift detection on real production traffic. Large: same plus per-shard recall metrics and an adversarial set. Massive: continuous eval on a sampled traffic mirror, per-shard health, learned-component A/B tests. The discipline scales; the surface area grows.
Can you tell me upfront which tier my build is?
Yes, in about thirty minutes. We need to know your corpus shape (documents, average length, growth rate), your query distribution shape, your latency target, your residency requirements, and your team's ops capacity. Output is a named tier and a one-page architecture sketch.
DIRECT RAG · APPLIED K

Tell us the corpus shape. We will name the tier.

Thirty-minute call. Output is a tier name, an architecture sketch, and a one-page summary you can take back to the team.