← All methods·11 · REASONING DISTILLATION · SPECIALISED

★ DistillationSpecialised

METHOD · TEACHER-STUDENT DISTILLATION

Distillation. Make a small model think like a big one.

Train a small student on a large teacher's outputs. DeepSeek-R1 (Jan 2025) used 800k verified reasoning trajectories to SFT smaller students (Qwen 1.5B to Llama 70B) into frontier-grade reasoning at a fraction of training cost. The 2025 breakout pattern.

TRLvLLM (teacher rollouts)Together AI

Talk to our team →Fine-tuning hub

Teacher

GPT-4.1, Claude Sonnet 4.5, R1, custom

Student

Open base (Llama, Qwen, Mistral, Gemma, Phi)

Data

100k to 1M generations

Win

10x smaller, often within a few points

[WHY THIS EXISTS]

Frontier models are too expensive and too slow.

Most production traffic does not need a 100B+ frontier model. Distillation captures the teacher's behaviour on the specific task into a 7B to 14B student you can serve cheaply. Reasoning distillation (R1 lineage) extended this from outputs-only to full reasoning traces.

Generate teacher outputs (or reasoning traces) on representative prompts
Filter for quality (verifier, judge, or human review)
SFT the student on the filtered teacher data
Optional second stage: DPO or GRPO on top

[THE PIPELINE]

Distillation, end to end.

Generate, filter, train, iterate. Verifier in the loop for reasoning tasks.

Prompt set (representative)

Teacher LLM

Generate N attempts each

Filter (verifier or judge)

SFT student on filtered data

Optional: DPO or GRPO

Define the prompt distribution

The student will only generalise where the prompts cover. Sample from real production traffic if you can. Synthesize otherwise with Evol-Instruct or similar.

Generate teacher attempts

N=1 to 8 per prompt. For reasoning, capture the full chain of thought, not just the final answer.

Filter aggressively

Verifier where possible (correct answer, valid output). LLM-judge with a calibration set otherwise. Bad teacher data poisons the student.

SFT, then optionally DPO or GRPO

SFT the student on the filtered teacher data. Add DPO if you have preference data, GRPO if the task has a verifier.

[THE STACK WE'D DEPLOY]

What we run in production for Distillation.

TRLvLLM (teacher rollouts)Together AI distillationPredibase

[ACCURACY · COST · TRADE]

The numbers we measure Distillation on.

DeepSeek-R1 distill 7B (Qwen)

Beats o1-mini on math benchmarks

Per DeepSeek-R1 paper

Teacher data volume

100k to 1M generations

Cost

Teacher inference + student training

When it earns the build

Latency-critical serving, cost-pressured workloads, reasoning behaviour the base does not have. Specialist tasks where you can ship a small task-tuned model instead of a generalist.

When it doesn't

When the teacher does not materially outperform the student on the task. When you cannot legally use the teacher's outputs (check terms of service).

[OUR TAKE]

The 2025 method that changed the small-model market.

We distil aggressively when latency, cost, or vendor lock-in is the constraint. R1-style reasoning distillation is in every serious project plan we make in 2026.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

Can we legally distil from OpenAI / Anthropic models?: Check the ToS. OpenAI prohibits using outputs to train competing models. Anthropic has similar terms. Open-weight teachers (Llama, R1, Mistral) are the safe choice for redistribution.
Distillation vs RAG?: Distillation puts the knowledge in the weights. RAG keeps it in a corpus. Distil behaviour and reasoning, retrieve facts.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering Distillation? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics