How long should the conversation buffer be?

Short by default (3-5 turns verbatim plus a summary of older context). Sessions in production are usually short enough that this is plenty. Longer buffers add cost without much quality.

What about privacy and retention?

Conversation state is sensitive data. Default to short retention, encryption at rest, deletion on request, per-tenant isolation. Document the policy in the build, audit it in production.

Does memory replace the need for citations?

No. Citations are still required for the answer; memory just makes the retrieval and generation steps see the conversation context. Different problem, different solution.

★ Simple RAG with memorySpecialised patternEval-gated

RAG WITH MEMORY · CONVERSATIONAL CONTEXT

RAG with conversation memory. Follow-ups resolve, pronouns work.

Naive (or Advanced) RAG plus a conversation buffer. The retrieval prompt sees prior turns so 'its population' resolves to 'Paris' from the previous question. Cheap addition, big UX win for chat workloads.

pgvectorClaudeOpenAIConversation store

Start a conversation →All architectures →

Best for

Chat · multi-turn support

Stack

Base RAG + conversation buffer

Latency

Negligible vs base

Privacy concern

Persistent conversation state

[AT A GLANCE]

Best for: Customer support chat, tutoring, internal assistants, any RAG workload where the second question depends on the first. Not multi-document research; that's Agentic territory.

Origin

Standard chat-RAG pattern, 2022 onward

Year

2022-present

Complexity

Simple

Production stage

Mature

[THE PIPELINE]

Carry conversation state into retrieval.

The change from plain RAG is small: the conversation buffer feeds into both the retrieval query (so 'its population' becomes 'Paris population') and the generation prompt (so the model remembers the user prefers brief answers).

Buffer prior turns

Conversation store keeps the last N turns (or a summarised buffer for long sessions). Per-user, per-tenant, with retention controls.

Resolve query against buffer

Cheap LLM call rewrites the new query against the buffer: pronouns, abbreviations, contextual references all resolved.

Retrieve against the resolved query

Standard retrieval against the rewritten query, not the raw one. Recall is meaningfully better on multi-turn sessions.

Generate with prior context

Prompt sees the prior turns plus the retrieved evidence. Tone, preference, and continuity all maintained.

[TECHNICAL STACK]

What we'd actually deploy.

Stack adds a conversation store and a cheap rewrite call. The harder questions are privacy and retention, not engineering.

CONVERSATION STORE

Postgres or Redis

Per-user, per-tenant, with retention controls aligned to compliance. Postgres for durable, Redis when sub-millisecond reads matter.

REWRITE LLM

Claude Haiku or GPT-5.5 (low effort)

Cheap call to resolve the query. Often a few cents per session at scale.

BASE RETRIEVAL

Whatever the base RAG is

Memory is an add-on. Use it on top of Naive or Advanced as appropriate.

RETENTION POLICY

Configurable per tenant

Defaults to short retention; longer windows opt-in with documented purpose. Logged to audit.

[HOW WE DEPLOY]

Day one to live traffic.

Memory adds about a week to a base RAG deploy, mostly on the retention and privacy questions rather than the engineering.

01
Decide retention policy
Conversation buffer is sensitive data. Retention window, encryption, deletion-on-request all decided up front with the buyer.
02
Build the conversation store
Postgres or Redis depending on read pattern. Per-user, per-tenant isolation enforced at the data layer.
03
Add the rewrite step
Cheap LLM call before retrieval. Calibrated against the eval set: rewriting too aggressively can hurt single-shot recall.
04
Generation with prior context
Update the generation prompt to include the buffer. Keep token budgets visible; long sessions need summarisation.
05
Eval set with multi-turn cases
Eval set must include multi-turn sessions. Otherwise memory regressions are invisible.

[ACCURACY + BENCHMARKS]

What the numbers say.

Multi-turn retrieval quality goes up materially. Hard to benchmark cleanly because most public benchmarks are single-turn; production teams report ~10-25% lift on multi-turn satisfaction metrics.

+10-25%

Multi-turn satisfaction

Negligible

Latency adder vs base RAG

$ cheap

Cost per session

Privacy

New compliance surface

Our eval methodology

We grade multi-turn cases separately from single-shot. The eval set explicitly includes pronoun-resolution and context-carry cases. Without that, the memory benefit is invisible.

[COMMUNITY FEEDBACK]

What practitioners report.

Memory is a near-universal addition for chat-RAG. LangChain, LlamaIndex, and every chat-as-a-service vendor ships some version of it.

The practitioner consensus has two camps: short-buffer summarisation, and full-history retrieval. Short-buffer is simpler and cheaper, usually plenty for support and tutoring. Full-history retrieval (treat prior turns as a retrievable corpus) earns the build only when sessions are long enough that the buffer cannot fit.

[COMMON PITFALLS]

No retention policy. Conversation state grows unbounded, becomes a compliance liability.
No multi-turn eval cases. Memory is added, regressions go unnoticed.
Long buffers without summarisation. Token cost grows linearly with session length.
Storing memory at the user level instead of the conversation level. Mixes sessions, hurts privacy.

[KENSINK LABS EVALUATION]

Our honest take.

We add memory whenever the workload is chat. The cost is small, the win is meaningful. The harder questions are about retention, not engineering.

Most of our chat-RAG builds use memory. The pattern that almost always works in production is short-buffer with summarisation: keep the last 3-5 turns verbatim, summarise older context if the session runs long. Full-history retrieval is interesting for long-form research workloads but adds complexity we rarely need.

[WHEN WE REACH FOR IT]

Customer support chat with multi-turn issue resolution.
Internal assistants where users ask follow-ups within a session.
Tutoring and educational tools where context-carry is the product.

What we'd substitute

Plain Advanced RAG for single-shot workloads (FAQ over a website, batch document Q&A). Memory adds compliance surface that is not worth it without multi-turn value.

[RELATED PATTERNS]

Worth a look next.

Related pattern

[COMMON QUESTIONS]

What buyers ask before they sign.

How long should the conversation buffer be?: Short by default (3-5 turns verbatim plus a summary of older context). Sessions in production are usually short enough that this is plenty. Longer buffers add cost without much quality.
What about privacy and retention?: Conversation state is sensitive data. Default to short retention, encryption at rest, deletion on request, per-tenant isolation. Document the policy in the build, audit it in production.
Does memory replace the need for citations?: No. Citations are still required for the answer; memory just makes the retrieval and generation steps see the conversation context. Different problem, different solution.

DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.

Start a conversation →All RAG topics

RAG with conversation memory. Follow-ups resolve, pronouns work.

Carry conversation state into retrieval.

Buffer prior turns

Resolve query against buffer

Retrieve against the resolved query

Generate with prior context

What we'd actually deploy.

Postgres or Redis

Claude Haiku or GPT-5.5 (low effort)

Whatever the base RAG is

Configurable per tenant

Day one to live traffic.

Decide retention policy

Build the conversation store

Add the rewrite step

Generation with prior context

Eval set with multi-turn cases

What the numbers say.

What practitioners report.

Our honest take.

Worth a look next.

Naive RAG

Advanced RAG

Agentic RAG

What buyers ask before they sign.

Bring the corpus. We'll bring the build.