Llama 4 Maverick (85.5% MMLU, late 2025), Qwen 3 (leading multilingual), Mistral Small 4 (Apache 2.0, function calling native), Phi-4 (small + strong), DeepSeek V3 (MoE). Pick by license posture, target size, and the language coverage that matches your data. Always start from base, never from Instruct, when you intend to retrain alignment.
1B to 100B tokens of curated domain text. Vocabulary extension if the tokenization is inefficient. Replay 5 to 20% of the original instruction-tuning mix to limit forgetting. Multi-node FSDP or Megatron-LM, checkpoint every 1B tokens. This is the stage that teaches vocabulary the base never saw (legal Latin, ICD codes, chemistry SMILES, regional scripts).
Full SFT for deep re-tasking, LoRA at rank 64 for cheaper iteration. LR 1e-5 to 5e-5, cosine, low warmup. The transition from CPT to SFT is delicate: too high an LR destroys the CPT capabilities, too low and SFT does not stick.
DPO is our default. SimPO claims +6.4 AlpacaEval 2 over DPO and is worth benchmarking. ORPO collapses SFT and preference into one stage if budget is tight. KTO when production feedback is thumbs not pairs. Beta 0.1 to 0.5, 1 to 2 epochs.
DeepSeek-R1 (Jan 2025) used 800k verified reasoning trajectories to SFT smaller students (Qwen 1.5B up to Llama 70B) into frontier-grade reasoning. The 2025 breakout pattern. Distill from R1, GPT-4.1, Claude Sonnet 4.5, or a custom reasoning teacher into the production-target student. Verifier in the loop for math and code.
Combine task-specific fine-tunes into a single deployable with mergekit. TIES for general consolidation, DARE for noisy LoRAs, SLERP for two-model blends. Multi-skill consolidation in minutes, no GPU training needed.
MMLU-Pro, IFEval, MT-Bench, AlpacaEval 2, Arena-Hard, domain golden set. HarmBench + JailbreakBench for safety. Bias audit. LLM-as-judge with calibration. Block the deploy on any regression.
Sigstore + in-toto attestation per checkpoint. Dataset hash maps to base model hash maps to checkpoint hash. Model card per OECD format with intended use, training data summary, evaluation results, known limitations, copyright posture. Required for EU AI Act GPAI providers.