Stage 1: SFT to get a strong starting point
Full SFT or LoRA SFT on instruction-following data. DPO works best from a well-behaved base, not from raw chat data.
Direct Preference Optimization (Rafailov et al., NeurIPS 2023) reframes RLHF as a classification loss on preference pairs. No reward model, no PPO, no rollouts. The 2024-2026 production workhorse for alignment, with SimPO and ORPO as the strongest challengers.
RLHF with PPO needs a reward model, a value head, rollouts, and tight hyperparameter discipline. DPO derives a closed-form optimal policy update from a preference dataset and trains it as classification on (prompt, chosen, rejected) triples. Same alignment, no PPO loop, runs on the same infra as SFT.
SFT first, capture preferences, train DPO with the SFT model as reference, eval, deploy.
Full SFT or LoRA SFT on instruction-following data. DPO works best from a well-behaved base, not from raw chat data.
Each row is (prompt, chosen response, rejected response). Source from production thumbs, edit deltas (the user's edit becomes the chosen), or LLM-as-judge pairwise comparisons.
Beta 0.1 to 0.5 controls how much the policy can deviate from the SFT reference. Higher beta is more conservative, lower allows bigger preference shifts. 1 to 2 epochs typically.
Arena-Hard and MT-Bench plus the domain golden set. DPO can over-refuse or shorten responses, the eval suite surfaces it.
After SFT, when you have preference data (thumbs, edits, side-by-side judgments), when alignment, helpfulness, or refusal calibration is the goal.
Without an SFT base (run SFT first), without preference data (the feedback loop has to exist), when KTO single-response binary feedback is what you have instead.
DPO is the safe, well-supported choice in TRL, vLLM, and every major platform. SimPO can claim better headline numbers (+6.4 AlpacaEval 2 over DPO per the paper) but DPO is the one we ship by default in 2026.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more