Embed query
Same embedding model as the corpus. Output is a fixed-dim vector (1536 for OpenAI 3-small, 2048 for 3-large, etc.).
Embed the query, fetch the top-K most similar chunks, paste into the prompt, generate the answer. No rewriting, no fusion, no reranking, no citations. The baseline every RAG project starts from before earning its way to anything fancier.
Best for: Internal FAQ bots, narrow-scope chatbots, single-domain search over a small clean corpus. Always the right starting point to measure against; rarely the right endpoint.
The naive pipeline is what fits on a slide. Embed the query with the same model used for the corpus, find the K nearest neighbours by cosine similarity, paste them into the prompt, generate. Every later RAG pattern is some refinement of one of these three steps.
Same embedding model as the corpus. Output is a fixed-dim vector (1536 for OpenAI 3-small, 2048 for 3-large, etc.).
Cosine similarity over the embedding index. HNSW or IVF on pgvector at small scale; brute-force on very small corpora. Returns top-K passages (typical K is 3-10).
Paste passages into the prompt with a simple template ('Use the context below to answer the question'). No citation requirement, no fusion, no rerank.
Stack fits in a Postgres database and an LLM API. Often the right tool to prove the corpus is workable before investing in heavier patterns.
Cheap, fast, good enough for the baseline. Move to a stronger embedding only after measuring the ceiling.
HNSW index at any reasonable scale. No separate system needed.
Mid-tier model is plenty for baseline. Upgrade only after the corpus quality bottlenecks have been addressed.
We deploy Naive RAG in hours when the corpus is clean and small. The point is usually to measure: how far does the simplest possible thing get us, and where does it break?
Fixed-size chunks (300-500 tokens) with small overlap is the baseline. No fancy semantic chunking on the first pass.
Batch embedding, written to pgvector with an HNSW index.
Simple endpoint that embeds the query, fetches top-K, calls the LLM.
Same golden set we will use to evaluate every subsequent pattern. Naive RAG's score on this set is the floor; every later pattern has to clear it.
Naive RAG sets the floor. Most reports show 30-50% accuracy gap vs Advanced RAG on production benchmarks. Useful as a baseline; rarely a destination.
We always run Naive RAG first on the golden eval set. Numbers go into the build doc. Every subsequent pattern earns its complexity by beating these numbers measurably.
Naive RAG is the entry point in every tutorial and the punchline in every production retrospective. It is what most ChatGPT plugins shipped; it is also what most production teams replace within a quarter.
The consensus is the same in every retrospective: it works on toy corpora and breaks on real ones. The two failure modes are universal. First, recall falls off as the corpus grows because pure-vector search misses exact-term and rare-token matches. Second, the LLM hallucinates confidently on top of whatever was retrieved, with no citation discipline to make the error visible.
We deploy Naive RAG to measure, not to ship. It is the first thing we build in every engagement; it is almost never the thing we hand over.
There is exactly one production context where we have shipped Naive RAG and walked away: an internal FAQ bot over fewer than a thousand chunks where the eval set showed Naive scoring within 5% of Advanced on every question. The buyer accepted the gap for the operational simplicity. That kind of corpus is rare; most production builds need at least hybrid plus rerank.
Advanced RAG once the eval set shows Naive falling short, which is almost always. Add hybrid + rerank + citation discipline and the gap closes immediately.
The pattern Naive turns into the moment the eval set demands more.
Read playbookRelated patternNaive plus a conversation buffer; the next-smallest step up.
Read playbookRelated patternThe engineering posture that lets you replace Naive's parts one at a time.
Read playbook