Kensink Labs
← All methods·04 · PREFERENCE OPTIMIZATION · PRIMARY
DPOPrimary method
METHOD · DIRECT PREFERENCE OPTIMIZATION

DPO. RLHF without the PPO loop.

Direct Preference Optimization (Rafailov et al., NeurIPS 2023) reframes RLHF as a classification loss on preference pairs. No reward model, no PPO, no rollouts. The 2024-2026 production workhorse for alignment, with SimPO and ORPO as the strongest challengers.

TRLPEFTHuggingFaceTogether AI
Data
5k to 100k preference pairs
Hardware
Same as LoRA SFT
Cost
~2.5x SFT on Together AI pricing
Use
After SFT, before deploy
[WHY THIS EXISTS]

RLHF works. PPO is also brittle, expensive, and a research project.

RLHF with PPO needs a reward model, a value head, rollouts, and tight hyperparameter discipline. DPO derives a closed-form optimal policy update from a preference dataset and trains it as classification on (prompt, chosen, rejected) triples. Same alignment, no PPO loop, runs on the same infra as SFT.

  • Loss is a log-likelihood ratio between chosen and rejected responses
  • Reference model is the SFT checkpoint, frozen, used to anchor the policy
  • Two epochs, low LR (1e-7 to 1e-6 for full DPO, 5e-5 for LoRA DPO)
  • Capture preference pairs from thumbs-up/down, edit deltas, side-by-side judgements
[THE PIPELINE]

DPO, end to end.

SFT first, capture preferences, train DPO with the SFT model as reference, eval, deploy.

SFT checkpoint
Preference pairs (chosen, rejected)
DPO loss vs reference
Train 1 to 2 epochs
Eval (Arena-Hard, MT-Bench)
Ship
01

Stage 1: SFT to get a strong starting point

Full SFT or LoRA SFT on instruction-following data. DPO works best from a well-behaved base, not from raw chat data.

02

Stage 2: collect preference pairs

Each row is (prompt, chosen response, rejected response). Source from production thumbs, edit deltas (the user's edit becomes the chosen), or LLM-as-judge pairwise comparisons.

03

Stage 3: DPO training with low LR

Beta 0.1 to 0.5 controls how much the policy can deviate from the SFT reference. Higher beta is more conservative, lower allows bigger preference shifts. 1 to 2 epochs typically.

04

Stage 4: eval before deploy

Arena-Hard and MT-Bench plus the domain golden set. DPO can over-refuse or shorten responses, the eval suite surfaces it.

[THE STACK WE'D DEPLOY]

What we run in production for DPO.

TRL (DPOTrainer)PEFTHuggingFace DatasetsTogether AI DPOPredibase
[ACCURACY · COST · TRADE]

The numbers we measure DPO on.

Win rate vs SFT baseline
+10 to +30% on Arena-Hard
Strong dependence on preference data quality
Training data
5k preference pairs is a floor
Cost vs SFT
~2.5x
Per Together AI pricing
Risk
Over-refusal, length bias
When it earns the build

After SFT, when you have preference data (thumbs, edits, side-by-side judgments), when alignment, helpfulness, or refusal calibration is the goal.

When it doesn't

Without an SFT base (run SFT first), without preference data (the feedback loop has to exist), when KTO single-response binary feedback is what you have instead.

[OUR TAKE]

Our default preference optimization method. We layer SimPO when the data is high quality.

DPO is the safe, well-supported choice in TRL, vLLM, and every major platform. SimPO can claim better headline numbers (+6.4 AlpacaEval 2 over DPO per the paper) but DPO is the one we ship by default in 2026.

[COMMON QUESTIONS]

What buyers ask before they sign.

DPO vs RLHF with PPO?
DPO wins on engineering tractability. PPO can squeeze out a bit more performance in skilled hands but the gap has narrowed and the PPO infrastructure tax is large. We default to DPO.
How many preference pairs do we need?
5k is a floor for a noticeable effect. 20k to 50k is where most production wins land. Quality dominates quantity: a small clean set of human preferences beats a large noisy LLM-judge set.
DPO over an Instruct model?
Yes, but lower the LR and use a smaller beta (0.05 to 0.1). DPO on an already-aligned model nudges, not retrains.
FINE-TUNING · KENSINK LABS

Considering DPO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.