Pick the base, not the Instruct
Start from a base model when you intend to teach a new format or behaviour. Start from an Instruct model when you only need to nudge style and want to preserve alignment. Mixing the two is the most common mistake.
Two small low-rank matrices that compose with frozen base weights. 99% of the accuracy of a full fine-tune, 1% of the VRAM, and the adapter file is small enough to ship as a build artifact. The first method to reach for, the one to beat before considering anything else.
Updating every weight in a 70B model is expensive (multi-node, days), destructive (you cannot keep the base current), and operationally painful (each tenant gets a 140GB checkpoint). LoRA factors the weight update into two low-rank matrices so you train a few million parameters instead of tens of billions, and ship a 50MB adapter instead of a model.
Prepare data, freeze base, attach LoRA modules to attention and MLP, train, evaluate, ship adapter.
Start from a base model when you intend to teach a new format or behaviour. Start from an Instruct model when you only need to nudge style and want to preserve alignment. Mixing the two is the most common mistake.
Our 2026 default. Higher rank (32 to 64) only when DoRA or rsLoRA is also in play, otherwise you trade compute for marginal accuracy. Target all linear layers (q, k, v, o, gate, up, down). Skipping the MLP projections is a quiet accuracy loss.
Cosine schedule, warmup 3%, weight decay 0. Watch eval loss every quarter epoch. The window between underfitting and overfitting is small for LoRA, eval gates catch it.
200 to 1000 prompts captured before training began, with expected behaviour written down. Block the adapter from production if it regresses any one of: factuality, format, safety. Same gate every adapter passes.
Single-task or single-tenant adaptation, multi-tenant SaaS with per-customer adapters, anywhere you want to keep the base current with provider updates, anything where memory is tight.
When you need 10x more model capacity (full SFT proven necessary in benchmarks), when the LoRA signal is being washed out by the base (try DoRA), when reasoning is the goal (consider GRPO/RFT instead).
Across legal extraction, support automation, structured generation, and code modernisation projects, LoRA at rank 16 with all-linear targeting has cleared the bar every time. Reach for full SFT only after we have benchmarked LoRA and proven the gap.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more