Buffer prior turns
Conversation store keeps the last N turns (or a summarised buffer for long sessions). Per-user, per-tenant, with retention controls.
Naive (or Advanced) RAG plus a conversation buffer. The retrieval prompt sees prior turns so 'its population' resolves to 'Paris' from the previous question. Cheap addition, big UX win for chat workloads.
Best for: Customer support chat, tutoring, internal assistants, any RAG workload where the second question depends on the first. Not multi-document research; that's Agentic territory.
The change from plain RAG is small: the conversation buffer feeds into both the retrieval query (so 'its population' becomes 'Paris population') and the generation prompt (so the model remembers the user prefers brief answers).
Conversation store keeps the last N turns (or a summarised buffer for long sessions). Per-user, per-tenant, with retention controls.
Cheap LLM call rewrites the new query against the buffer: pronouns, abbreviations, contextual references all resolved.
Standard retrieval against the rewritten query, not the raw one. Recall is meaningfully better on multi-turn sessions.
Prompt sees the prior turns plus the retrieved evidence. Tone, preference, and continuity all maintained.
Stack adds a conversation store and a cheap rewrite call. The harder questions are privacy and retention, not engineering.
Per-user, per-tenant, with retention controls aligned to compliance. Postgres for durable, Redis when sub-millisecond reads matter.
Cheap call to resolve the query. Often a few cents per session at scale.
Memory is an add-on. Use it on top of Naive or Advanced as appropriate.
Defaults to short retention; longer windows opt-in with documented purpose. Logged to audit.
Memory adds about a week to a base RAG deploy, mostly on the retention and privacy questions rather than the engineering.
Conversation buffer is sensitive data. Retention window, encryption, deletion-on-request all decided up front with the buyer.
Postgres or Redis depending on read pattern. Per-user, per-tenant isolation enforced at the data layer.
Cheap LLM call before retrieval. Calibrated against the eval set: rewriting too aggressively can hurt single-shot recall.
Update the generation prompt to include the buffer. Keep token budgets visible; long sessions need summarisation.
Eval set must include multi-turn sessions. Otherwise memory regressions are invisible.
Multi-turn retrieval quality goes up materially. Hard to benchmark cleanly because most public benchmarks are single-turn; production teams report ~10-25% lift on multi-turn satisfaction metrics.
We grade multi-turn cases separately from single-shot. The eval set explicitly includes pronoun-resolution and context-carry cases. Without that, the memory benefit is invisible.
Memory is a near-universal addition for chat-RAG. LangChain, LlamaIndex, and every chat-as-a-service vendor ships some version of it.
The practitioner consensus has two camps: short-buffer summarisation, and full-history retrieval. Short-buffer is simpler and cheaper, usually plenty for support and tutoring. Full-history retrieval (treat prior turns as a retrievable corpus) earns the build only when sessions are long enough that the buffer cannot fit.
We add memory whenever the workload is chat. The cost is small, the win is meaningful. The harder questions are about retention, not engineering.
Most of our chat-RAG builds use memory. The pattern that almost always works in production is short-buffer with summarisation: keep the last 3-5 turns verbatim, summarise older context if the session runs long. Full-history retrieval is interesting for long-form research workloads but adds complexity we rarely need.
Plain Advanced RAG for single-shot workloads (FAQ over a website, batch document Q&A). Memory adds compliance surface that is not worth it without multi-turn value.