← All methods·03 · FULL SUPERVISED FINE-TUNING · PRIMARY

★ Full SFTPrimary method

METHOD · SUPERVISED FINE-TUNING

Full SFT. When LoRA has been benchmarked and shown to fall short.

Updating every weight in the model on labelled (prompt, response) pairs. The high-cost, high-control option. LoRA Learns Less and Forgets Less (Biderman et al., 2024) showed full SFT learns 10x to 100x higher-rank perturbations than typical LoRA, which is the case for it and the case against it in one sentence.

PyTorchFSDPDeepSpeedTRL

Talk to our team →Fine-tuning hub

Data

10k to 1M examples

Hardware

8 to 64 GPUs (H100 or H200)

Cost

$5k to $200k per run depending on size

Trade

Best accuracy, worst flexibility

[WHY THIS EXISTS]

LoRA cannot model every weight update.

Most enterprise fine-tunes are well-served by a low-rank perturbation. Some are not: deep re-tasking (changing the model's fundamental skill), training a new tokenizer's surface, or pushing past LoRA's accuracy ceiling on hard reasoning. For those, full SFT updates every parameter directly.

Every weight updates, every layer adapts
Highest accuracy ceiling, especially on hard reasoning and unusual formats
Largest forgetting risk: base capabilities can degrade if data is narrow
Largest cost: multi-node, multi-day, multi-five-figures

[THE PIPELINE]

Full SFT, end to end.

Multi-node FSDP or DeepSpeed ZeRO-3, checkpoint at every epoch, eval gate before any deploy.

Dataset (>10k labelled)

Pack + bucket

FSDP or ZeRO-3

Train 1 to 3 epochs

Checkpoint per epoch

Full eval suite

Safety + bias gates

Ship checkpoint

Pick the parallelism

FSDP for 8B to 70B on 8 GPUs. DeepSpeed ZeRO-3 with CPU offload when you barely overflow VRAM. Megatron-LM only at frontier scale and only if you already run NVIDIA's stack.

LR 1e-5 to 5e-5, cosine, low warmup

Two orders of magnitude lower than LoRA. Cosine with 1% to 3% warmup. Weight decay 0.05 to 0.1. Aggressive LR destroys alignment on Instruct bases.

Checkpoint every epoch, eval every checkpoint

Disk is cheap, model regret is not. Keep at least the last 3 epoch checkpoints. Run the golden eval set after each.

Full eval suite before deploy

Standard benchmarks (MMLU-Pro, IFEval), domain golden set, safety eval (HarmBench), bias audit. Full SFT can drop on benchmarks you did not train for, the suite catches it.

[THE STACK WE'D DEPLOY]

What we run in production for Full SFT.

PyTorch FSDPDeepSpeedTRLMegatron-LMWeights and Biases

[ACCURACY · COST · TRADE]

The numbers we measure Full SFT on.

Accuracy ceiling

Highest of any method

Per Biderman et al. 2024, learns 10x to 100x higher-rank perturbations than LoRA

Forgetting (vs base)

Real and measurable

Mitigate with replay data or constrained LR

Compute cost

10x to 100x LoRA

Iteration speed

Days, not hours

When it earns the build

When LoRA has been tried and benchmarked below the bar, when you are training a new tokenizer surface, when you need a deeply re-tasked base (rare in 2026), when you have the data volume (>10k) and the compute to support it.

When it doesn't

Under 10k examples (memorization risk, LoRA wins on regularization), when LoRA cleared the bar (do not pay the full SFT tax for marginal gain), when you need fast iteration cycles.

[OUR TAKE]

We benchmark LoRA first. We only reach for full SFT when the numbers say so.

On most enterprise builds, LoRA at rank 16 with DoRA matches full SFT within 1 to 3 points at 1% of the cost. Where the gap is real and the project requires it (hard reasoning, deep re-tasking), we move to full SFT with eyes open.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

How much data do we need for full SFT?: At least 10k high-quality examples to avoid memorization. Below that, LoRA's implicit regularization wins. Above 100k, the full SFT accuracy ceiling starts to pull away. The right answer is to benchmark both on a held-out set.
Can full SFT make the model worse?: Yes, easily. Narrow training data erodes base capabilities (catastrophic forgetting). Mitigate with replay (10 to 20% of the original instruction-tuning mix mixed back in), low LR, and full eval suite gating before deploy.
Full SFT vs continued pretraining (CPT)?: CPT is self-supervised next-token training on a domain corpus. SFT is supervised on (prompt, response) pairs. CPT first when the domain has new vocabulary or tokenization, then SFT. Skip CPT for narrow task adaptation.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering Full SFT? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics