Kensink Labs
← All methods·11 · REASONING DISTILLATION · SPECIALISED
DistillationSpecialised
METHOD · TEACHER-STUDENT DISTILLATION

Distillation. Make a small model think like a big one.

Train a small student on a large teacher's outputs. DeepSeek-R1 (Jan 2025) used 800k verified reasoning trajectories to SFT smaller students (Qwen 1.5B to Llama 70B) into frontier-grade reasoning at a fraction of training cost. The 2025 breakout pattern.

TRLvLLM (teacher rollouts)Together AI
Teacher
GPT-4.1, Claude Sonnet 4.5, R1, custom
Student
Open base (Llama, Qwen, Mistral, Gemma, Phi)
Data
100k to 1M generations
Win
10x smaller, often within a few points
[WHY THIS EXISTS]

Frontier models are too expensive and too slow.

Most production traffic does not need a 100B+ frontier model. Distillation captures the teacher's behaviour on the specific task into a 7B to 14B student you can serve cheaply. Reasoning distillation (R1 lineage) extended this from outputs-only to full reasoning traces.

  • Generate teacher outputs (or reasoning traces) on representative prompts
  • Filter for quality (verifier, judge, or human review)
  • SFT the student on the filtered teacher data
  • Optional second stage: DPO or GRPO on top
[THE PIPELINE]

Distillation, end to end.

Generate, filter, train, iterate. Verifier in the loop for reasoning tasks.

Prompt set (representative)
Teacher LLM
Generate N attempts each
Filter (verifier or judge)
SFT student on filtered data
Optional: DPO or GRPO
01

Define the prompt distribution

The student will only generalise where the prompts cover. Sample from real production traffic if you can. Synthesize otherwise with Evol-Instruct or similar.

02

Generate teacher attempts

N=1 to 8 per prompt. For reasoning, capture the full chain of thought, not just the final answer.

03

Filter aggressively

Verifier where possible (correct answer, valid output). LLM-judge with a calibration set otherwise. Bad teacher data poisons the student.

04

SFT, then optionally DPO or GRPO

SFT the student on the filtered teacher data. Add DPO if you have preference data, GRPO if the task has a verifier.

[THE STACK WE'D DEPLOY]

What we run in production for Distillation.

TRLvLLM (teacher rollouts)Together AI distillationPredibase
[ACCURACY · COST · TRADE]

The numbers we measure Distillation on.

DeepSeek-R1 distill 7B (Qwen)
Beats o1-mini on math benchmarks
Per DeepSeek-R1 paper
Teacher data volume
100k to 1M generations
Cost
Teacher inference + student training
When it earns the build

Latency-critical serving, cost-pressured workloads, reasoning behaviour the base does not have. Specialist tasks where you can ship a small task-tuned model instead of a generalist.

When it doesn't

When the teacher does not materially outperform the student on the task. When you cannot legally use the teacher's outputs (check terms of service).

[OUR TAKE]

The 2025 method that changed the small-model market.

We distil aggressively when latency, cost, or vendor lock-in is the constraint. R1-style reasoning distillation is in every serious project plan we make in 2026.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

Can we legally distil from OpenAI / Anthropic models?
Check the ToS. OpenAI prohibits using outputs to train competing models. Anthropic has similar terms. Open-weight teachers (Llama, R1, Mistral) are the safe choice for redistribution.
Distillation vs RAG?
Distillation puts the knowledge in the weights. RAG keeps it in a corpus. Distil behaviour and reasoning, retrieve facts.
FINE-TUNING · KENSINK LABS

Considering Distillation? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.