Kensink Labs
Simple RAG with memorySpecialised patternEval-gated
RAG WITH MEMORY · CONVERSATIONAL CONTEXT

RAG with conversation memory. Follow-ups resolve, pronouns work.

Naive (or Advanced) RAG plus a conversation buffer. The retrieval prompt sees prior turns so 'its population' resolves to 'Paris' from the previous question. Cheap addition, big UX win for chat workloads.

pgvectorClaudeOpenAIConversation store
Best for
Chat · multi-turn support
Stack
Base RAG + conversation buffer
Latency
Negligible vs base
Privacy concern
Persistent conversation state
[AT A GLANCE]

Best for: Customer support chat, tutoring, internal assistants, any RAG workload where the second question depends on the first. Not multi-document research; that's Agentic territory.

Origin
Standard chat-RAG pattern, 2022 onward
Year
2022-present
Complexity
Simple
Production stage
Mature
[THE PIPELINE]

Carry conversation state into retrieval.

The change from plain RAG is small: the conversation buffer feeds into both the retrieval query (so 'its population' becomes 'Paris population') and the generation prompt (so the model remembers the user prefers brief answers).

New query
+ Conversation buffer
Rewritten retrieval query
Embed
Vector search
LLM generate
01

Buffer prior turns

Conversation store keeps the last N turns (or a summarised buffer for long sessions). Per-user, per-tenant, with retention controls.

02

Resolve query against buffer

Cheap LLM call rewrites the new query against the buffer: pronouns, abbreviations, contextual references all resolved.

03

Retrieve against the resolved query

Standard retrieval against the rewritten query, not the raw one. Recall is meaningfully better on multi-turn sessions.

04

Generate with prior context

Prompt sees the prior turns plus the retrieved evidence. Tone, preference, and continuity all maintained.

[TECHNICAL STACK]

What we'd actually deploy.

Stack adds a conversation store and a cheap rewrite call. The harder questions are privacy and retention, not engineering.

CONVERSATION STORE

Postgres or Redis

Per-user, per-tenant, with retention controls aligned to compliance. Postgres for durable, Redis when sub-millisecond reads matter.

REWRITE LLM

Claude Haiku or GPT-5.5 (low effort)

Cheap call to resolve the query. Often a few cents per session at scale.

BASE RETRIEVAL

Whatever the base RAG is

Memory is an add-on. Use it on top of Naive or Advanced as appropriate.

RETENTION POLICY

Configurable per tenant

Defaults to short retention; longer windows opt-in with documented purpose. Logged to audit.

[HOW WE DEPLOY]

Day one to live traffic.

Memory adds about a week to a base RAG deploy, mostly on the retention and privacy questions rather than the engineering.

  1. 01

    Decide retention policy

    Conversation buffer is sensitive data. Retention window, encryption, deletion-on-request all decided up front with the buyer.

  2. 02

    Build the conversation store

    Postgres or Redis depending on read pattern. Per-user, per-tenant isolation enforced at the data layer.

  3. 03

    Add the rewrite step

    Cheap LLM call before retrieval. Calibrated against the eval set: rewriting too aggressively can hurt single-shot recall.

  4. 04

    Generation with prior context

    Update the generation prompt to include the buffer. Keep token budgets visible; long sessions need summarisation.

  5. 05

    Eval set with multi-turn cases

    Eval set must include multi-turn sessions. Otherwise memory regressions are invisible.

[ACCURACY + BENCHMARKS]

What the numbers say.

Multi-turn retrieval quality goes up materially. Hard to benchmark cleanly because most public benchmarks are single-turn; production teams report ~10-25% lift on multi-turn satisfaction metrics.

+10-25%
Multi-turn satisfaction
Negligible
Latency adder vs base RAG
$ cheap
Cost per session
Privacy
New compliance surface
Our eval methodology

We grade multi-turn cases separately from single-shot. The eval set explicitly includes pronoun-resolution and context-carry cases. Without that, the memory benefit is invisible.

[COMMUNITY FEEDBACK]

What practitioners report.

Memory is a near-universal addition for chat-RAG. LangChain, LlamaIndex, and every chat-as-a-service vendor ships some version of it.

The practitioner consensus has two camps: short-buffer summarisation, and full-history retrieval. Short-buffer is simpler and cheaper, usually plenty for support and tutoring. Full-history retrieval (treat prior turns as a retrievable corpus) earns the build only when sessions are long enough that the buffer cannot fit.

[COMMON PITFALLS]
  • No retention policy. Conversation state grows unbounded, becomes a compliance liability.
  • No multi-turn eval cases. Memory is added, regressions go unnoticed.
  • Long buffers without summarisation. Token cost grows linearly with session length.
  • Storing memory at the user level instead of the conversation level. Mixes sessions, hurts privacy.
[KENSINK LABS EVALUATION]

Our honest take.

We add memory whenever the workload is chat. The cost is small, the win is meaningful. The harder questions are about retention, not engineering.

Most of our chat-RAG builds use memory. The pattern that almost always works in production is short-buffer with summarisation: keep the last 3-5 turns verbatim, summarise older context if the session runs long. Full-history retrieval is interesting for long-form research workloads but adds complexity we rarely need.

[WHEN WE REACH FOR IT]
  • Customer support chat with multi-turn issue resolution.
  • Internal assistants where users ask follow-ups within a session.
  • Tutoring and educational tools where context-carry is the product.
What we'd substitute

Plain Advanced RAG for single-shot workloads (FAQ over a website, batch document Q&A). Memory adds compliance surface that is not worth it without multi-turn value.

[COMMON QUESTIONS]

What buyers ask before they sign.

How long should the conversation buffer be?
Short by default (3-5 turns verbatim plus a summary of older context). Sessions in production are usually short enough that this is plenty. Longer buffers add cost without much quality.
What about privacy and retention?
Conversation state is sensitive data. Default to short retention, encryption at rest, deletion on request, per-tenant isolation. Document the policy in the build, audit it in production.
Does memory replace the need for citations?
No. Citations are still required for the answer; memory just makes the retrieval and generation steps see the conversation context. Different problem, different solution.
DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.