Kensink Labs
← All methods·03 · FULL SUPERVISED FINE-TUNING · PRIMARY
Full SFTPrimary method
METHOD · SUPERVISED FINE-TUNING

Full SFT. When LoRA has been benchmarked and shown to fall short.

Updating every weight in the model on labelled (prompt, response) pairs. The high-cost, high-control option. LoRA Learns Less and Forgets Less (Biderman et al., 2024) showed full SFT learns 10x to 100x higher-rank perturbations than typical LoRA, which is the case for it and the case against it in one sentence.

PyTorchFSDPDeepSpeedTRL
Data
10k to 1M examples
Hardware
8 to 64 GPUs (H100 or H200)
Cost
$5k to $200k per run depending on size
Trade
Best accuracy, worst flexibility
[WHY THIS EXISTS]

LoRA cannot model every weight update.

Most enterprise fine-tunes are well-served by a low-rank perturbation. Some are not: deep re-tasking (changing the model's fundamental skill), training a new tokenizer's surface, or pushing past LoRA's accuracy ceiling on hard reasoning. For those, full SFT updates every parameter directly.

  • Every weight updates, every layer adapts
  • Highest accuracy ceiling, especially on hard reasoning and unusual formats
  • Largest forgetting — base capabilities can degrade if data is narrow
  • Largest cost: multi-node, multi-day, multi-five-figures
[THE PIPELINE]

Full SFT, end to end.

Multi-node FSDP or DeepSpeed ZeRO-3, checkpoint at every epoch, eval gate before any deploy.

Dataset (>10k labelled)
Pack + bucket
FSDP or ZeRO-3
Train 1 to 3 epochs
Checkpoint per epoch
Full eval suite
Safety + bias gates
Ship checkpoint
01

Pick the parallelism

FSDP for 8B to 70B on 8 GPUs. DeepSpeed ZeRO-3 with CPU offload when you barely overflow VRAM. Megatron-LM only at frontier scale and only if you already run NVIDIA's stack.

02

LR 1e-5 to 5e-5, cosine, low warmup

Two orders of magnitude lower than LoRA. Cosine with 1% to 3% warmup. Weight decay 0.05 to 0.1. Aggressive LR destroys alignment on Instruct bases.

03

Checkpoint every epoch, eval every checkpoint

Disk is cheap, model regret is not. Keep at least the last 3 epoch checkpoints. Run the golden eval set after each.

04

Full eval suite before deploy

Standard benchmarks (MMLU-Pro, IFEval), domain golden set, safety eval (HarmBench), bias audit. Full SFT can drop on benchmarks you did not train for, the suite catches it.

[THE STACK WE'D DEPLOY]

What we run in production for Full SFT.

PyTorch FSDPDeepSpeedTRLMegatron-LMWeights and Biases
[ACCURACY · COST · TRADE]

The numbers we measure Full SFT on.

Accuracy ceiling
Highest of any method
Per Biderman et al. 2024, learns 10x to 100x higher-rank perturbations than LoRA
Forgetting (vs base)
Real and measurable
Mitigate with replay data or constrained LR
Compute cost
10x to 100x LoRA
Iteration speed
Days, not hours
When it earns the build

When LoRA has been tried and benchmarked below the bar, when you are training a new tokenizer surface, when you need a deeply re-tasked base (rare in 2026), when you have the data volume (>10k) and the compute to support it.

When it doesn't

Under 10k examples (memorization risk, LoRA wins on regularization), when LoRA cleared the bar (do not pay the full SFT tax for marginal gain), when you need fast iteration cycles.

[OUR TAKE]

We benchmark LoRA first. We only reach for full SFT when the numbers say so.

On most enterprise builds, LoRA at rank 16 with DoRA matches full SFT within 1 to 3 points at 1% of the cost. Where the gap is real and the project requires it (hard reasoning, deep re-tasking), we move to full SFT with eyes open.

[COMMON QUESTIONS]

What buyers ask before they sign.

How much data do we need for full SFT?
At least 10k high-quality examples to avoid memorization. Below that, LoRA's implicit regularization wins. Above 100k, the full SFT accuracy ceiling starts to pull away. The right answer is to benchmark both on a held-out set.
Can full SFT make the model worse?
Yes, easily. Narrow training data erodes base capabilities (catastrophic forgetting). Mitigate with replay (10 to 20% of the original instruction-tuning mix mixed back in), low LR, and full eval suite gating before deploy.
Full SFT vs continued pretraining (CPT)?
CPT is self-supervised next-token training on a domain corpus. SFT is supervised on (prompt, response) pairs. CPT first when the domain has new vocabulary or tokenization, then SFT. Skip CPT for narrow task adaptation.
FINE-TUNING · KENSINK LABS

Considering Full SFT? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.