Why bother building Naive RAG if it is just a baseline?

Because the baseline is the only honest measurement. Without it, you cannot prove the more complex patterns are earning their cost. Every Kensink engagement starts here.

How small does a corpus need to be for Naive to be enough?

Less about size than shape. Clean prose, well-keyed documents, narrow query distribution. Fewer than ~100k chunks usually. The eval set is the deciding number, not the chunk count.

Naive RAG plus better embeddings?

Pays off a little. Pays off less than adding BM25 plus a reranker. The biggest wins are in fusion plus rerank, not in pushing embeddings to the frontier.

★ Naive RAGSpecialised patternEval-gated

NAIVE RAG · THE BASELINE EVERYONE STARTS WITH

Naive RAG. The fastest thing that can possibly work.

Embed the query, fetch the top-K most similar chunks, paste into the prompt, generate the answer. No rewriting, no fusion, no reranking, no citations. The baseline every RAG project starts from before earning its way to anything fancier.

pgvectorOpenAICosine similarity

Start a conversation →All architectures →

Build time

Hours, not weeks

Stack

Postgres + pgvector

Accuracy ceiling

Low on noisy or large corpora

Earns the build

Narrow scope · clean corpus

[AT A GLANCE]

Best for: Internal FAQ bots, narrow-scope chatbots, single-domain search over a small clean corpus. Always the right starting point to measure against; rarely the right endpoint.

Origin

Lewis et al., RAG (2020)

Year

2020

Complexity

Simple

Production stage

Mature

[THE PIPELINE]

Embed, fetch, generate.

The naive pipeline is what fits on a slide. Embed the query with the same model used for the corpus, find the K nearest neighbours by cosine similarity, paste them into the prompt, generate. Every later RAG pattern is some refinement of one of these three steps.

Embed query

Same embedding model as the corpus. Output is a fixed-dim vector (1536 for OpenAI 3-small, 2048 for 3-large, etc.).

Top-K vector search

Cosine similarity over the embedding index. HNSW or IVF on pgvector at small scale; brute-force on very small corpora. Returns top-K passages (typical K is 3-10).

Generate

Paste passages into the prompt with a simple template ('Use the context below to answer the question'). No citation requirement, no fusion, no rerank.

[TECHNICAL STACK]

What we'd actually deploy.

Stack fits in a Postgres database and an LLM API. Often the right tool to prove the corpus is workable before investing in heavier patterns.

EMBEDDING MODEL

OpenAI text-embedding-3-small

Cheap, fast, good enough for the baseline. Move to a stronger embedding only after measuring the ceiling.

VECTOR STORE

PostgreSQL + pgvector

HNSW index at any reasonable scale. No separate system needed.

GENERATION

GPT-5.5 or Claude Sonnet

Mid-tier model is plenty for baseline. Upgrade only after the corpus quality bottlenecks have been addressed.

[HOW WE DEPLOY]

Day one to live traffic.

We deploy Naive RAG in hours when the corpus is clean and small. The point is usually to measure: how far does the simplest possible thing get us, and where does it break?

01
Chunk the corpus
Fixed-size chunks (300-500 tokens) with small overlap is the baseline. No fancy semantic chunking on the first pass.
02
Embed + index
Batch embedding, written to pgvector with an HNSW index.
03
Query + generate
Simple endpoint that embeds the query, fetches top-K, calls the LLM.
04
Eval set baseline
Same golden set we will use to evaluate every subsequent pattern. Naive RAG's score on this set is the floor; every later pattern has to clear it.

[ACCURACY + BENCHMARKS]

What the numbers say.

Naive RAG sets the floor. Most reports show 30-50% accuracy gap vs Advanced RAG on production benchmarks. Useful as a baseline; rarely a destination.

Baseline

The floor every other pattern beats

-30-50%

Accuracy vs Advanced RAG

<10ms

Added latency over a raw LLM call

Hours

Build time

Our eval methodology

We always run Naive RAG first on the golden eval set. Numbers go into the build doc. Every subsequent pattern earns its complexity by beating these numbers measurably.

[COMMUNITY FEEDBACK]

What practitioners report.

Naive RAG is the entry point in every tutorial and the punchline in every production retrospective. It is what most ChatGPT plugins shipped; it is also what most production teams replace within a quarter.

The consensus is the same in every retrospective: it works on toy corpora and breaks on real ones. The two failure modes are universal. First, recall falls off as the corpus grows because pure-vector search misses exact-term and rare-token matches. Second, the LLM hallucinates confidently on top of whatever was retrieved, with no citation discipline to make the error visible.

[COMMON PITFALLS]

Treating it as a destination. It is a baseline.
No eval set. Without numbers, you cannot tell when it has stopped being good enough.
Fixed-size chunking on a corpus where chunking shape matters (tables, code, transcripts).
No citations. The model will invent confident wrong answers and you will not know.

[KENSINK LABS EVALUATION]

Our honest take.

We deploy Naive RAG to measure, not to ship. It is the first thing we build in every engagement; it is almost never the thing we hand over.

There is exactly one production context where we have shipped Naive RAG and walked away: an internal FAQ bot over fewer than a thousand chunks where the eval set showed Naive scoring within 5% of Advanced on every question. The buyer accepted the gap for the operational simplicity. That kind of corpus is rare; most production builds need at least hybrid plus rerank.

[WHEN WE REACH FOR IT]

Day-one baseline on every engagement. We measure here first.
Toy and internal demos to validate the corpus is shaped reasonably.
Genuinely small clean corpora where the eval shows nothing fancier is needed.

What we'd substitute

Advanced RAG once the eval set shows Naive falling short, which is almost always. Add hybrid + rerank + citation discipline and the gap closes immediately.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

Why bother building Naive RAG if it is just a baseline?: Because the baseline is the only honest measurement. Without it, you cannot prove the more complex patterns are earning their cost. Every Kensink engagement starts here.
How small does a corpus need to be for Naive to be enough?: Less about size than shape. Clean prose, well-keyed documents, narrow query distribution. Fewer than ~100k chunks usually. The eval set is the deciding number, not the chunk count.
Naive RAG plus better embeddings?: Pays off a little. Pays off less than adding BM25 plus a reranker. The biggest wins are in fusion plus rerank, not in pushing embeddings to the frontier.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

Naive RAG. The fastest thing that can possibly work.

Embed, fetch, generate.

Embed query

Top-K vector search

Generate

What we'd actually deploy.

OpenAI text-embedding-3-small

PostgreSQL + pgvector

GPT-5.5 or Claude Sonnet

Day one to live traffic.

Chunk the corpus

Embed + index

Query + generate

Eval set baseline

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Advanced RAG

Simple RAG with memory

Modular RAG

What buyers ask before they sign.

Bring the corpus. We'll bring the build.