★ Fine-tuning methodsDirect LLM · benchmark-firstProduction grade

FINE-TUNING · METHODS

The method taxonomy. Ranked by when each ships.

Twelve named techniques in production engineering use. Five we reach for in most conversations. Seven specialised. Each one has a detail page with the stack, the eval bar, and our verdict.

PyTorchPEFTTRLUnslothAxolotl

Start a conversation →Fine-tuning hub →

Methods

12 with detail pages

Primary

LoRA, QLoRA, SFT, DPO, GRPO

Default

LoRA r=16 with DoRA

Discipline

Eval gate before any promotion

[EVERY METHOD, FULL TREATMENT]

Each one sketched, ranked, with a playbook.

Primary methods carry a brand gradient. Specialised methods sit on neutral cards. Every card links to the full playbook: stack, hyperparameters, deploy steps, accuracy notes, our take.

01PrimaryLOW-RANK ADAPTATION · PRIMARY

LoRA

Two small low-rank matrices that compose with frozen base weights. 99% of the accuracy of a full fine-tune, 1% of the VRAM, and the adapter file is small enough to ship as a build artifact. The first method to reach for, the one to beat before considering anything else.

When it earns the build

Single-task or single-tenant adaptation, multi-tenant SaaS with per-customer adapters, anywhere you want to keep the base current with provider updates, anything where memory is tight.

When it doesn't

When you need 10x more model capacity (full SFT proven necessary in benchmarks), when the LoRA signal is being washed out by the base (try DoRA), when reasoning is the goal (consider GRPO/RFT instead).

Read the full playbook →Stack: PyTorch · PEFT · TRL

02PrimaryQUANTIZED LORA · PRIMARY

QLoRA

LoRA on top of a 4-bit NF4 quantized base. A 65B model trains on a single 48GB GPU. The accuracy hit is real but small, the cost reduction is not. Our default for 70B fine-tunes on a single node and for any team without dedicated H100s.

When it earns the build

When the base does not fit in FP16 LoRA VRAM, when single-GPU training is the budget, when iteration speed needs many cheap experiments, when serving will use a separate FP16 path.

When it doesn't

When you can afford FP16 LoRA on the available hardware (FP16 is faster and slightly more accurate), when the 0.5 to 1.5 point accuracy gap is the whole job's failure mode.

Read the full playbook →Stack: bitsandbytes · PEFT · TRL

03PrimaryFULL SUPERVISED FINE-TUNING · PRIMARY

Full SFT

Updating every weight in the model on labelled (prompt, response) pairs. The high-cost, high-control option. LoRA Learns Less and Forgets Less (Biderman et al., 2024) showed full SFT learns 10x to 100x higher-rank perturbations than typical LoRA, which is the case for it and the case against it in one sentence.

When it earns the build

When LoRA has been tried and benchmarked below the bar, when you are training a new tokenizer surface, when you need a deeply re-tasked base (rare in 2026), when you have the data volume (>10k) and the compute to support it.

When it doesn't

Under 10k examples (memorization risk, LoRA wins on regularization), when LoRA cleared the bar (do not pay the full SFT tax for marginal gain), when you need fast iteration cycles.

Read the full playbook →Stack: PyTorch FSDP · DeepSpeed · TRL

04PrimaryPREFERENCE OPTIMIZATION · PRIMARY

DPO

Direct Preference Optimization (Rafailov et al., NeurIPS 2023) reframes RLHF as a classification loss on preference pairs. No reward model, no PPO, no rollouts. The 2024-2026 production workhorse for alignment, with SimPO and ORPO as the strongest challengers.

When it earns the build

After SFT, when you have preference data (thumbs, edits, side-by-side judgments), when alignment, helpfulness, or refusal calibration is the goal.

When it doesn't

Without an SFT base (run SFT first), without preference data (the feedback loop has to exist), when KTO single-response binary feedback is what you have instead.

Read the full playbook →Stack: TRL (DPOTrainer) · PEFT · HuggingFace Datasets

05PrimaryREINFORCEMENT FINE-TUNING · PRIMARY

GRPO / RFT

Group Relative Policy Optimization (DeepSeek, Jan 2025) trains the model with verifiable rewards (was the answer right, does the format parse, did the tool call validate) by normalizing rewards within sampled groups. Drops the value critic, drops half the engineering. OpenAI's Reinforcement Fine-Tuning API is the managed version on o4-mini at $100 per training hour, $5k per-job cap.

When it earns the build

Math, code, structured extraction, tool use, agent decision making. Anywhere a verifier can grade the attempt as correct or not.

When it doesn't

Subjective tasks (style, creative writing) with no clean verifier. Without an SFT starting point. Below ~$50k of compute budget for ambitious runs.

Read the full playbook →Stack: TRL (GRPOTrainer) · OpenAI RFT API · Unsloth GRPO

06SpecialisedWEIGHT-DECOMPOSED LORA · SPECIALISED

DoRA

Decomposes the weight matrix into magnitude (a scalar per column) and direction (a unit vector). LoRA modulates only the direction, the magnitude is trained separately. Reports +1 to +4.4% over LoRA on commonsense benchmarks (LLaMA-7B/13B, LLaMA3-8B). Our default replacement for plain LoRA in 2026.

When it earns the build

Anywhere you would use LoRA. The gain is a flag flip.

When it doesn't

When the framework does not support it (older PEFT, custom training stacks).

Read the full playbook →Stack: PEFT (use_dora=True) · TRL · Unsloth (DoRA support)

07SpecialisedREFERENCE-FREE DPO · SPECIALISED

SimPO

Reference-free preference optimization with a length-normalized log-probability reward. Reports +6.4 AlpacaEval 2 and +7.5 Arena-Hard over DPO on the same training data (Meng et al., NeurIPS 2024). The strongest DPO challenger as of 2026.

When it earns the build

High-quality preference data, when memory at training is tight, when DPO has been benchmarked and the upgrade is worth the extra hyperparameter discipline.

When it doesn't

Without an SFT base or preference data, when noisy preference labels make the length normalization unstable.

Read the full playbook →Stack: TRL (CPOTrainer with loss_type='simpo') · PEFT · HuggingFace Datasets

08SpecialisedONE-STAGE SFT + PREFERENCE · SPECIALISED

ORPO

Hong et al., 2024 merge supervised fine-tuning and preference learning into a single loss. Trains on (prompt, chosen, rejected) data once, no separate SFT step required. Useful when the compute budget is for one run, not two.

When it earns the build

One-shot fine-tunes, tight budgets, projects where the SFT+DPO loop has not paid off.

When it doesn't

When you want explicit control over the SFT stage, or when DPO has measurably won on this data.

Read the full playbook →Stack: TRL (ORPOTrainer) · PEFT

09SpecialisedBINARY-FEEDBACK PREFERENCE · SPECIALISED

KTO

KTO (Ethayarajh et al., 2024) trains on individual binary signals: this response was good, this one was bad. Drops the (chosen, rejected) pair requirement of DPO and SimPO. Matches reality: production feedback is thumbs, not side-by-side comparisons.

When it earns the build

When production feedback is thumbs, when you cannot run pairwise comparisons, when iterative shipping is the discipline.

When it doesn't

When you have clean pairwise data (use DPO or SimPO), when class balance is extreme (1:100+).

Read the full playbook →Stack: TRL (KTOTrainer) · PEFT

10SpecialisedDOMAIN-ADAPTIVE PRETRAINING · SPECIALISED

Continued pretraining

Self-supervised next-token training on a large unlabelled domain corpus. The right answer before SFT when the domain has new vocabulary, tokenization, or scripts (legal Latin, ICD codes, chemistry SMILES, non-Latin languages). Typically 1B to 100B tokens, $10k to $500k.

When it earns the build

New vocabulary, new tokenization, foreign scripts, deep domain language (legal, biomedical, code, non-English).

When it doesn't

Narrow task adaptation (SFT alone), small data (under 1B tokens), domains the base already saw enough of.

Read the full playbook →Stack: PyTorch FSDP · Megatron-LM · DeepSpeed

11SpecialisedREASONING DISTILLATION · SPECIALISED

Distillation

Train a small student on a large teacher's outputs. DeepSeek-R1 (Jan 2025) used 800k verified reasoning trajectories to SFT smaller students (Qwen 1.5B to Llama 70B) into frontier-grade reasoning at a fraction of training cost. The 2025 breakout pattern.

When it earns the build

Latency-critical serving, cost-pressured workloads, reasoning behaviour the base does not have. Specialist tasks where you can ship a small task-tuned model instead of a generalist.

When it doesn't

When the teacher does not materially outperform the student on the task. When you cannot legally use the teacher's outputs (check terms of service).

Read the full playbook →Stack: TRL · vLLM (teacher rollouts) · Together AI distillation

12SpecialisedWEIGHT ARITHMETIC · SPECIALISED

Model merging

Merging combines multiple fine-tunes into one model by averaging, trimming, or arithmetic on the deltas. Model soup averages, TIES trims and resolves sign conflicts, DARE drops and rescales delta parameters. Production use: stitch task-specific LoRAs into a single deployable.

When it earns the build

Multi-skill consolidation, serving cost reduction (one model vs N adapters), behaviour composition (combining a refusal-tuned model with a code-tuned model).

When it doesn't

When the source fine-tunes are deeply incompatible (different bases, different vocabularies), when per-task accuracy is the project's whole goal.

Read the full playbook →Stack: mergekit · PEFT (LoRA add) · HuggingFace Transformers

[WHAT YOU GET]

What we leave on every fine-tuning build.

Methods considered, one named

Audit

RAG vs fine-tune, written

Eval

Golden set frozen before training

Compliance

Article 25 + DPIA + model card

[COMMON QUESTIONS]

What buyers ask before they sign.

Why is LoRA the default and not full SFT?: Biderman et al. (2024) confirmed full SFT learns 10x to 100x higher-rank perturbations than typical LoRA but forgets more. For most enterprise tasks, the LoRA accuracy gap is within 1 to 3 points and the cost gap is 10x to 100x. We benchmark LoRA first and reach for full SFT only when the numbers force it.
DPO, SimPO, ORPO, KTO: which one?: DPO is our production default. SimPO claims +6.4 AlpacaEval 2 over DPO on clean data and is worth benchmarking. ORPO collapses SFT and preference into one stage when budget is tight. KTO is the right answer when production feedback is thumbs (single response) rather than pairs.
When does GRPO / RFT win?: When the task has a verifier. Unit tests for code, schema validators for structured output, ground-truth answers for math. R1 set the precedent. OpenAI RFT API is the managed version on o4-mini at $100/hr. Without a verifier, stick with SFT and DPO.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Pick the method. Bring the data.

We will benchmark LoRA against your real eval bar, name the trade, and ship a measured adapter. Sized to the work, scoped to the residency, signed at the artifact.

Start a conversation →All fine-tuning topics

The method taxonomy. Ranked by when each ships.

Each one sketched, ranked, with a playbook.

LoRA

QLoRA

Full SFT

DPO

GRPO / RFT

DoRA

SimPO

ORPO

KTO

Continued pretraining

Distillation

Model merging

What we leave on every fine-tuning build.

What buyers ask before they sign.

Worth a look next.

Data pipeline

Platforms

By data + compute scale

Custom model build

Compliance

Pick the method. Bring the data.