Kensink Labs
HyDESpecialised patternEval-gated
HYDE · HYPOTHETICAL DOCUMENT EMBEDDINGS

HyDE. Imagine the answer, then go find it.

Hypothetical Document Embeddings: have the LLM write a fake answer to the question, embed the fake answer, search for real documents similar to it. Counterintuitive, but lifts recall on technical, jargon-heavy, or sparse queries where the question vocabulary does not match the document vocabulary.

ClaudeOpenAIpgvector
Best for
Technical · jargon-heavy queries
Stack
LLM + base RAG
Cost
Extra LLM call per query
Explainability
Lower than direct search
[AT A GLANCE]

Best for: Technical domains where users ask questions in everyday language but the documents use technical jargon. Scientific literature, medical Q&A, code search, legal queries. The vocabulary gap is where HyDE shines.

Origin
Gao et al., HyDE (2022)
Year
2022-2026
Complexity
Simple
Production stage
Mature
[THE PIPELINE]

Hypothesise, embed, retrieve, generate.

HyDE inverts the usual flow. Instead of embedding the user's query and finding similar documents, we first have the LLM write a hypothetical answer to the query, then embed that hypothetical and find documents similar to it. The hypothetical is throwaway; the retrieved real documents are what generation sees.

Query
LLM: write hypothetical answer
Embed hypothetical
Vector search
Retrieve real documents
Generate from real docs
01

Hypothetical answer generation

Cheap LLM call writes a 100-200 token hypothetical answer. The answer is allowed to be wrong; we are not going to use it. We just need it to share vocabulary with the real documents.

02

Embed the hypothetical

Standard embedding model. The hypothetical is closer to document vocabulary than the original query was.

03

Retrieve real documents

Top-K vector search. Returns real documents from the corpus that match the hypothetical's embedding.

04

Generate from real evidence

Standard generation using the retrieved real documents. The hypothetical is discarded. Citation discipline still required.

[TECHNICAL STACK]

What we'd actually deploy.

Stack is base RAG plus a cheap LLM call for the hypothetical. The lift on recall is the win; the explainability cost is the trade.

HYPOTHETICAL GENERATOR

Claude Haiku or GPT-5.5 (low effort)

Cheap LLM call writes the hypothetical. Quality of the hypothetical does not have to be high; vocabulary alignment is what matters.

EMBEDDING + STORE

Same as base RAG

HyDE works with any embedding model and any vector store. Nothing changes about the underlying retrieval.

GENERATION

Same as base RAG

Standard citation-required generation. The hypothetical is thrown away once retrieval is done.

[HOW WE DEPLOY]

Day one to live traffic.

HyDE deploys as a small addition to base RAG. The work is calibrating which queries benefit (jargon-heavy) and which do not (well-aligned).

  1. 01

    Identify the vocabulary gap

    On the eval set, measure where the query vocabulary diverges from document vocabulary. HyDE helps where the gap is real and hurts where the gap is small.

  2. 02

    Hypothetical prompt

    Cheap LLM call to write a 100-200 token hypothetical. Calibrated to produce document-shaped text, not chat-shaped.

  3. 03

    Routing: which queries get HyDE

    On easy queries with no vocabulary gap, HyDE adds cost without recall lift. Light classifier decides per query whether to run HyDE.

  4. 04

    Eval set with technical queries

    Eval set must include the technical or jargon-heavy queries where HyDE was supposed to help. Otherwise the benefit is invisible.

[ACCURACY + BENCHMARKS]

What the numbers say.

HyDE lifts recall on queries where vocabulary mismatch is the failure mode. Published numbers consistent: meaningful on technical corpora, smaller on general-purpose ones.

+10-20%
Recall on technical queries (typical)
Negligible
Recall lift on well-aligned queries
$ cheap
Per-query cost adder
Lower
Explainability vs direct search
Our eval methodology

HyDE eval splits queries by vocabulary alignment with the corpus. We expect lift on misaligned queries and break-even (or slight regression from added noise) on aligned ones. Both populations gated separately.

[COMMUNITY FEEDBACK]

What practitioners report.

HyDE is a quietly mature pattern. Not glamorous, but it has earned its place in production toolkits for technical-domain RAG. LangChain and LlamaIndex both ship it as a built-in retrieval mode.

The practitioner consensus is that HyDE is a small, reliable lever for technical-domain RAG. The hypothetical generation is cheap, the recall lift is real on the queries where vocabulary mismatch is the bottleneck. The trade is explainability: it is harder to debug a retrieval that went through a hypothetical than one that went straight from query to documents.

[COMMON PITFALLS]
  • Running HyDE on every query. Adds cost on queries that did not need it.
  • Hypothetical that is too long or too short. 100-200 tokens is the sweet spot.
  • Treating the hypothetical as an answer. It is throwaway; the real documents are what matters.
  • Ignoring the explainability cost. When retrieval misfires, the hypothetical-mediated path is harder to debug than direct query embedding.
[KENSINK LABS EVALUATION]

Our honest take.

We reach for HyDE on technical-domain RAG where the vocabulary gap is real. We do not reach for it on general-purpose workloads where direct query embedding is already well-aligned with documents.

HyDE is a small lever that earns the build in narrow contexts. Most of our production RAG does not use HyDE because the corpus vocabulary aligns well with user queries. The contexts where HyDE earns its place are usually internal-tool RAG over technical or specialist content (engineering wiki, medical literature, legal databases). The eval set decides; we do not deploy HyDE without measuring the benefit first.

[WHEN WE REACH FOR IT]
  • Technical or jargon-heavy corpora where user queries are in everyday language.
  • Scientific and medical Q&A where document vocabulary diverges from query vocabulary.
  • Code search where the question is intent and the corpus is implementation.
What we'd substitute

Query rewriting (cheaper, more explainable) for queries where the issue is ambiguity rather than vocabulary. Better embeddings (longer-term fix) for systematic vocabulary problems.

[COMMON QUESTIONS]

What buyers ask before they sign.

Why would imagining a wrong answer help?
Because the imagined answer uses document vocabulary, while the user's question may not. The embedding similarity is between the hypothetical and real documents, not between the query and real documents. The hypothetical bridges the vocabulary gap.
How long should the hypothetical be?
100-200 tokens. Long enough to share vocabulary with documents, short enough to keep cost down. We calibrate per corpus.
When does HyDE not help?
When the corpus and user vocabulary are already well-aligned. Customer support over consumer-friendly product docs, conversational FAQ, general-knowledge Q&A. HyDE adds cost without recall lift in those settings.
DIRECT RAG · APPLIED K

Bring the corpus. We'll bring the build.

Senior engineers, eval suite at handoff, full source ownership. We integrate against the model and the index the same way we integrate against Postgres. Sized to the work in front of you.