Pick the parallelism
FSDP for 8B to 70B on 8 GPUs. DeepSpeed ZeRO-3 with CPU offload when you barely overflow VRAM. Megatron-LM only at frontier scale and only if you already run NVIDIA's stack.
Updating every weight in the model on labelled (prompt, response) pairs. The high-cost, high-control option. LoRA Learns Less and Forgets Less (Biderman et al., 2024) showed full SFT learns 10x to 100x higher-rank perturbations than typical LoRA, which is the case for it and the case against it in one sentence.
Most enterprise fine-tunes are well-served by a low-rank perturbation. Some are not: deep re-tasking (changing the model's fundamental skill), training a new tokenizer's surface, or pushing past LoRA's accuracy ceiling on hard reasoning. For those, full SFT updates every parameter directly.
Multi-node FSDP or DeepSpeed ZeRO-3, checkpoint at every epoch, eval gate before any deploy.
FSDP for 8B to 70B on 8 GPUs. DeepSpeed ZeRO-3 with CPU offload when you barely overflow VRAM. Megatron-LM only at frontier scale and only if you already run NVIDIA's stack.
Two orders of magnitude lower than LoRA. Cosine with 1% to 3% warmup. Weight decay 0.05 to 0.1. Aggressive LR destroys alignment on Instruct bases.
Disk is cheap, model regret is not. Keep at least the last 3 epoch checkpoints. Run the golden eval set after each.
Standard benchmarks (MMLU-Pro, IFEval), domain golden set, safety eval (HarmBench), bias audit. Full SFT can drop on benchmarks you did not train for, the suite catches it.
When LoRA has been tried and benchmarked below the bar, when you are training a new tokenizer surface, when you need a deeply re-tasked base (rare in 2026), when you have the data volume (>10k) and the compute to support it.
Under 10k examples (memorization risk, LoRA wins on regularization), when LoRA cleared the bar (do not pay the full SFT tax for marginal gain), when you need fast iteration cycles.
On most enterprise builds, LoRA at rank 16 with DoRA matches full SFT within 1 to 3 points at 1% of the cost. Where the gap is real and the project requires it (hard reasoning, deep re-tasking), we move to full SFT with eyes open.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more