Kensink Labs
← All methods·07 · REFERENCE-FREE DPO · SPECIALISED
SimPOSpecialised
METHOD · SIMPLE PREFERENCE OPTIMIZATION

SimPO. DPO minus the reference model, plus length normalization.

Reference-free preference optimization with a length-normalized log-probability reward. Reports +6.4 AlpacaEval 2 and +7.5 Arena-Hard over DPO on the same training data (Meng et al., NeurIPS 2024). The strongest DPO challenger as of 2026.

TRLPEFTHuggingFace
Gain vs DPO
+6.4 AlpacaEval 2, +7.5 Arena-Hard
Memory
Lower than DPO (no reference model)
Trade
Slightly higher hyperparameter sensitivity
[WHY THIS EXISTS]

DPO carries a reference model and has length bias.

DPO needs the SFT reference model in memory for the loss anchor. SimPO drops it: the reward is just the length-normalized log-probability of the response. Same preference data, one less model in memory, length bias controlled.

  • Reward = mean log-prob of the response under the policy
  • No reference model = lower memory and faster training
  • Length normalization stops the policy from rewarding longer outputs
  • Same (chosen, rejected) data as DPO
[THE PIPELINE]

SimPO, end to end.

Same data shape as DPO. No reference model in memory. Train.

SFT checkpoint
Preference pairs
SimPO loss (length-normalized)
Train 1 to 2 epochs
Eval (Arena-Hard)
Ship
01

SFT base + preference data

Same starting point as DPO. SimPO does not change the data layer.

02

TRL CPOTrainer with simpo_gamma + simpo_alpha

TRL supports SimPO via CPOTrainer with loss_type='simpo'. Defaults are reasonable. Tune gamma (target margin) and the length normalization weight if results look off.

[THE STACK WE'D DEPLOY]

What we run in production for SimPO.

TRL (CPOTrainer with loss_type='simpo')PEFTHuggingFace Datasets
[ACCURACY · COST · TRADE]

The numbers we measure SimPO on.

AlpacaEval 2 vs DPO
+6.4 points
Same training data
Arena-Hard vs DPO
+7.5 points
Memory at training
Lower (no reference model)
When it earns the build

High-quality preference data, when memory at training is tight, when DPO has been benchmarked and the upgrade is worth the extra hyperparameter discipline.

When it doesn't

Without an SFT base or preference data, when noisy preference labels make the length normalization unstable.

[OUR TAKE]

Strongest DPO challenger. We benchmark it alongside DPO on new projects.

On clean preference data SimPO often wins by the paper's numbers. On noisy data DPO can be more robust. We run both on the same data and pick by held-out evals.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

SimPO vs DPO — which should we use?
Benchmark both. SimPO has the better headline numbers, DPO has more battle testing. On a new project we run both with the same data and pick by golden-set eval.
FINE-TUNING · KENSINK LABS

Considering SimPO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.