Active query pipeline
Standard Advanced RAG runs for the current query. User-facing latency is unchanged.
Inspired by speculative execution in CPUs. While the current query runs, a lightweight predictor anticipates likely follow-ups and pre-fetches their retrieval in the background. When the user asks the predicted question, retrieval is already done.
Best for: Real-time chat where conversation flow matters. Customer support where common follow-ups are predictable. Interactive product tours and onboarding flows.
Speculative RAG runs in parallel with the active answer. A predictor LLM looks at the current query and conversation buffer, guesses likely follow-ups, and triggers pre-fetch retrieval for the top 1-3 candidates. The pre-fetched results are cached; if the user's next query matches, retrieval is already done.
Standard Advanced RAG runs for the current query. User-facing latency is unchanged.
Cheap LLM call predicts top-N likely follow-ups based on the query and conversation context. Output is structured: ranked candidates plus confidence.
Background retrieval runs for the top 1-3 predicted follow-ups. Results cached with a short TTL.
When the next query arrives, fuzzy match against cached predictions. Cache hit means retrieval is already done; cache miss means the active pipeline runs normally.
Stack is base RAG plus a predictor and a cache. The predictor accuracy is what determines whether the speculative work pays back.
Cheap call predicts likely follow-ups. Structured output: ranked candidates plus confidence.
Pre-fetch uses the same retrieval as the active path. Just runs ahead of time on predicted queries.
Short TTL (minutes), per-session scoped. Wrong predictions expire without taking up budget.
Next query embedded and compared against cached predictions. Above threshold means cache hit; below means cache miss.
Speculative RAG deploys as an optimisation on top of an existing chat-RAG. The predictor calibration is the work; the rest is engineering plumbing.
From production logs, identify which follow-ups are common after which queries. The signal is whether prediction is feasible at all.
Lightweight LLM predicts top-N follow-ups. Calibrated against the production-derived follow-up patterns.
Hard cap on pre-fetch retrievals per session. Avoid burning budget on speculative work for users who do not follow up.
Cache stores retrieval results keyed by predicted query embeddings. Next-query fuzzy match returns hits above a calibrated threshold.
Cache hit rate watched as the key metric. If it drops, predictor quality has drifted.
Speculative RAG does not change answer accuracy; it changes perceived latency on follow-ups. The metrics that matter are cache hit rate and time-to-first-token on cached queries.
Speculative RAG eval grades the predictor (precision on follow-up prediction) and the user-perceived latency improvement (on cache hits vs no speculation). Accuracy is held constant.
Speculative RAG is an emerging pattern. Interesting in research, deployed in production where interactive latency matters; less commonly seen outside chat workloads.
The practitioner consensus is that the value depends entirely on predictor accuracy. Highly conversational workloads with predictable follow-up patterns (customer support over a mature product, structured onboarding flows) see meaningful wins. Open-ended workloads (research, technical Q&A) see less benefit because follow-ups are less predictable.
We reach for Speculative RAG on interactive chat workloads where perceived latency on follow-ups materially affects the experience. We do not reach for it on cost-sensitive backend workloads.
Speculative RAG is a narrow pattern but a strong one when the workload fits. Real-time chat where users habitually follow up benefits. Asynchronous workloads (report generation, batch processing) do not. The classifier on the workload itself decides whether the pattern earns the build.
Adaptive RAG when cost optimisation matters more than latency. Plain Advanced RAG when follow-ups are rare or unpredictable.
Cousin pattern: adaptive optimises cost, speculative optimises latency.
Read playbookRelated patternOften built together; memory + speculation makes chat feel instant.
Read playbookRelated patternThe base retrieval that speculative wraps.
Read playbook