What is the cache hit rate in production?

Workload-dependent. 30-50% is typical on conversational support workloads where follow-ups are predictable. Lower (10-20%) on open-ended workloads where speculation rarely pays off.

How big is the predictor?

Cheap LLM call (Haiku-tier). The predictor itself is not the cost; the speculative retrieval is. Keep the predictor lightweight, cap the speculative budget.

Speculative vs Adaptive RAG?

Different optimisation. Adaptive trades query types for cost reduction; speculative trades wasted work for latency reduction. Sometimes both ship in the same system: adaptive picks the mode, speculative wraps the chat-mode for fast follow-ups.

★ Speculative RAGSpecialised patternEval-gated

SPECULATIVE RAG · ANTICIPATE THE NEXT QUERY

Speculative RAG. Pre-fetch likely follow-ups while answering this one.

Inspired by speculative execution in CPUs. While the current query runs, a lightweight predictor anticipates likely follow-ups and pre-fetches their retrieval in the background. When the user asks the predicted question, retrieval is already done.

Predictor LLMCacheEval pipelines

Start a conversation →All architectures →

Best for

Real-time chat · interactive UX

Stack

Base RAG + predictor + cache

Wins

Perceived latency on follow-ups

Cost

Wasted on wrong predictions

[AT A GLANCE]

Best for: Real-time chat where conversation flow matters. Customer support where common follow-ups are predictable. Interactive product tours and onboarding flows.

Origin

Wang et al., Speculative RAG (2024)

Year

2024-2026

Complexity

Complex

Production stage

Emerging

[THE PIPELINE]

Answer now, pre-fetch the likely next.

Speculative RAG runs in parallel with the active answer. A predictor LLM looks at the current query and conversation buffer, guesses likely follow-ups, and triggers pre-fetch retrieval for the top 1-3 candidates. The pre-fetched results are cached; if the user's next query matches, retrieval is already done.

Active query pipeline

Standard Advanced RAG runs for the current query. User-facing latency is unchanged.

Predictor in parallel

Cheap LLM call predicts top-N likely follow-ups based on the query and conversation context. Output is structured: ranked candidates plus confidence.

Pre-fetch top candidates

Background retrieval runs for the top 1-3 predicted follow-ups. Results cached with a short TTL.

Match next query against cache

When the next query arrives, fuzzy match against cached predictions. Cache hit means retrieval is already done; cache miss means the active pipeline runs normally.

[TECHNICAL STACK]

What we'd actually deploy.

Stack is base RAG plus a predictor and a cache. The predictor accuracy is what determines whether the speculative work pays back.

PREDICTOR LLM

Claude Haiku or GPT-5.5 (low effort)

Cheap call predicts likely follow-ups. Structured output: ranked candidates plus confidence.

PRE-FETCH RETRIEVAL

Same as the base RAG pipeline

Pre-fetch uses the same retrieval as the active path. Just runs ahead of time on predicted queries.

CACHE

Redis or Postgres + TTL

Short TTL (minutes), per-session scoped. Wrong predictions expire without taking up budget.

FUZZY MATCHER

Embedding similarity

Next query embedded and compared against cached predictions. Above threshold means cache hit; below means cache miss.

[HOW WE DEPLOY]

Day one to live traffic.

Speculative RAG deploys as an optimisation on top of an existing chat-RAG. The predictor calibration is the work; the rest is engineering plumbing.

01
Measure follow-up patterns
From production logs, identify which follow-ups are common after which queries. The signal is whether prediction is feasible at all.
02
Predictor calibration
Lightweight LLM predicts top-N follow-ups. Calibrated against the production-derived follow-up patterns.
03
Pre-fetch budget
Hard cap on pre-fetch retrievals per session. Avoid burning budget on speculative work for users who do not follow up.
04
Cache + fuzzy match
Cache stores retrieval results keyed by predicted query embeddings. Next-query fuzzy match returns hits above a calibrated threshold.
05
Production metrics
Cache hit rate watched as the key metric. If it drops, predictor quality has drifted.

[ACCURACY + BENCHMARKS]

What the numbers say.

Speculative RAG does not change answer accuracy; it changes perceived latency on follow-ups. The metrics that matter are cache hit rate and time-to-first-token on cached queries.

30-50%

Cache hit rate on follow-ups (typical)

~50%

Reduction in time-to-first-token on cache hits

Wasted

Cost on cache misses

Latency

What this really buys

Our eval methodology

Speculative RAG eval grades the predictor (precision on follow-up prediction) and the user-perceived latency improvement (on cache hits vs no speculation). Accuracy is held constant.

[COMMUNITY FEEDBACK]

What practitioners report.

Speculative RAG is an emerging pattern. Interesting in research, deployed in production where interactive latency matters; less commonly seen outside chat workloads.

The practitioner consensus is that the value depends entirely on predictor accuracy. Highly conversational workloads with predictable follow-up patterns (customer support over a mature product, structured onboarding flows) see meaningful wins. Open-ended workloads (research, technical Q&A) see less benefit because follow-ups are less predictable.

[COMMON PITFALLS]

Aggressive pre-fetching. Costs blow up when predictor is mediocre and users do not follow up.
Long cache TTLs. Stale pre-fetched results bleed into wrong follow-up matches.
No production metrics. Cache hit rate must be the key dashboard; otherwise the value is invisible.
Treating it as accuracy improvement. It is latency improvement; accuracy is held constant by construction.

[KENSINK LABS EVALUATION]

Our honest take.

We reach for Speculative RAG on interactive chat workloads where perceived latency on follow-ups materially affects the experience. We do not reach for it on cost-sensitive backend workloads.

Speculative RAG is a narrow pattern but a strong one when the workload fits. Real-time chat where users habitually follow up benefits. Asynchronous workloads (report generation, batch processing) do not. The classifier on the workload itself decides whether the pattern earns the build.

[WHEN WE REACH FOR IT]

Real-time customer support chat with predictable follow-up patterns.
Structured product onboarding flows where the next step is the next likely query.
Interactive tutoring where conversation flow matters.

What we'd substitute

Adaptive RAG when cost optimisation matters more than latency. Plain Advanced RAG when follow-ups are rare or unpredictable.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

What is the cache hit rate in production?: Workload-dependent. 30-50% is typical on conversational support workloads where follow-ups are predictable. Lower (10-20%) on open-ended workloads where speculation rarely pays off.
How big is the predictor?: Cheap LLM call (Haiku-tier). The predictor itself is not the cost; the speculative retrieval is. Keep the predictor lightweight, cap the speculative budget.
Speculative vs Adaptive RAG?: Different optimisation. Adaptive trades query types for cost reduction; speculative trades wasted work for latency reduction. Sometimes both ship in the same system: adaptive picks the mode, speculative wraps the chat-mode for fast follow-ups.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

Speculative RAG. Pre-fetch likely follow-ups while answering this one.

Answer now, pre-fetch the likely next.

Active query pipeline

Predictor in parallel

Pre-fetch top candidates

Match next query against cache

What we'd actually deploy.

Claude Haiku or GPT-5.5 (low effort)

Same as the base RAG pipeline

Redis or Postgres + TTL

Embedding similarity

Day one to live traffic.

Measure follow-up patterns

Predictor calibration

Pre-fetch budget

Cache + fuzzy match

Production metrics

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Adaptive RAG

Simple RAG with memory

Advanced RAG

What buyers ask before they sign.

Bring the corpus. We'll bring the build.