Hypothetical answer generation
Cheap LLM call writes a 100-200 token hypothetical answer. The answer is allowed to be wrong; we are not going to use it. We just need it to share vocabulary with the real documents.
Hypothetical Document Embeddings: have the LLM write a fake answer to the question, embed the fake answer, search for real documents similar to it. Counterintuitive, but lifts recall on technical, jargon-heavy, or sparse queries where the question vocabulary does not match the document vocabulary.
Best for: Technical domains where users ask questions in everyday language but the documents use technical jargon. Scientific literature, medical Q&A, code search, legal queries. The vocabulary gap is where HyDE shines.
HyDE inverts the usual flow. Instead of embedding the user's query and finding similar documents, we first have the LLM write a hypothetical answer to the query, then embed that hypothetical and find documents similar to it. The hypothetical is throwaway; the retrieved real documents are what generation sees.
Cheap LLM call writes a 100-200 token hypothetical answer. The answer is allowed to be wrong; we are not going to use it. We just need it to share vocabulary with the real documents.
Standard embedding model. The hypothetical is closer to document vocabulary than the original query was.
Top-K vector search. Returns real documents from the corpus that match the hypothetical's embedding.
Standard generation using the retrieved real documents. The hypothetical is discarded. Citation discipline still required.
Stack is base RAG plus a cheap LLM call for the hypothetical. The lift on recall is the win; the explainability cost is the trade.
Cheap LLM call writes the hypothetical. Quality of the hypothetical does not have to be high; vocabulary alignment is what matters.
HyDE works with any embedding model and any vector store. Nothing changes about the underlying retrieval.
Standard citation-required generation. The hypothetical is thrown away once retrieval is done.
HyDE deploys as a small addition to base RAG. The work is calibrating which queries benefit (jargon-heavy) and which do not (well-aligned).
On the eval set, measure where the query vocabulary diverges from document vocabulary. HyDE helps where the gap is real and hurts where the gap is small.
Cheap LLM call to write a 100-200 token hypothetical. Calibrated to produce document-shaped text, not chat-shaped.
On easy queries with no vocabulary gap, HyDE adds cost without recall lift. Light classifier decides per query whether to run HyDE.
Eval set must include the technical or jargon-heavy queries where HyDE was supposed to help. Otherwise the benefit is invisible.
HyDE lifts recall on queries where vocabulary mismatch is the failure mode. Published numbers consistent: meaningful on technical corpora, smaller on general-purpose ones.
HyDE eval splits queries by vocabulary alignment with the corpus. We expect lift on misaligned queries and break-even (or slight regression from added noise) on aligned ones. Both populations gated separately.
HyDE is a quietly mature pattern. Not glamorous, but it has earned its place in production toolkits for technical-domain RAG. LangChain and LlamaIndex both ship it as a built-in retrieval mode.
The practitioner consensus is that HyDE is a small, reliable lever for technical-domain RAG. The hypothetical generation is cheap, the recall lift is real on the queries where vocabulary mismatch is the bottleneck. The trade is explainability: it is harder to debug a retrieval that went through a hypothetical than one that went straight from query to documents.
We reach for HyDE on technical-domain RAG where the vocabulary gap is real. We do not reach for it on general-purpose workloads where direct query embedding is already well-aligned with documents.
HyDE is a small lever that earns the build in narrow contexts. Most of our production RAG does not use HyDE because the corpus vocabulary aligns well with user queries. The contexts where HyDE earns its place are usually internal-tool RAG over technical or specialist content (engineering wiki, medical literature, legal databases). The eval set decides; we do not deploy HyDE without measuring the benefit first.
Query rewriting (cheaper, more explainable) for queries where the issue is ambiguity rather than vocabulary. Better embeddings (longer-term fix) for systematic vocabulary problems.