← All methods·07 · REFERENCE-FREE DPO · SPECIALISED

★ SimPOSpecialised

METHOD · SIMPLE PREFERENCE OPTIMIZATION

SimPO. DPO minus the reference model, plus length normalization.

Reference-free preference optimization with a length-normalized log-probability reward. Reports +6.4 AlpacaEval 2 and +7.5 Arena-Hard over DPO on the same training data (Meng et al., NeurIPS 2024). The strongest DPO challenger as of 2026.

TRLPEFTHuggingFace

Talk to our team →Fine-tuning hub

Gain vs DPO

+6.4 AlpacaEval 2, +7.5 Arena-Hard

Memory

Lower than DPO (no reference model)

Trade

Slightly higher hyperparameter sensitivity

[WHY THIS EXISTS]

DPO carries a reference model and has length bias.

DPO needs the SFT reference model in memory for the loss anchor. SimPO drops it: the reward is just the length-normalized log-probability of the response. Same preference data, one less model in memory, length bias controlled.

Reward = mean log-prob of the response under the policy
No reference model = lower memory and faster training
Length normalization stops the policy from rewarding longer outputs
Same (chosen, rejected) data as DPO

[THE PIPELINE]

SimPO, end to end.

Same data shape as DPO. No reference model in memory. Train.

SFT checkpoint

Preference pairs

SimPO loss (length-normalized)

Train 1 to 2 epochs

Eval (Arena-Hard)

Ship

SFT base + preference data

Same starting point as DPO. SimPO does not change the data layer.

TRL CPOTrainer with simpo_gamma + simpo_alpha

TRL supports SimPO via CPOTrainer with loss_type='simpo'. Defaults are reasonable. Tune gamma (target margin) and the length normalization weight if results look off.

[THE STACK WE'D DEPLOY]

What we run in production for SimPO.

TRL (CPOTrainer with loss_type='simpo')PEFTHuggingFace Datasets

[ACCURACY · COST · TRADE]

The numbers we measure SimPO on.

AlpacaEval 2 vs DPO

+6.4 points

Same training data

Arena-Hard vs DPO

+7.5 points

Memory at training

Lower (no reference model)

When it earns the build

High-quality preference data, when memory at training is tight, when DPO has been benchmarked and the upgrade is worth the extra hyperparameter discipline.

When it doesn't

Without an SFT base or preference data, when noisy preference labels make the length normalization unstable.

[OUR TAKE]

Strongest DPO challenger. We benchmark it alongside DPO on new projects.

On clean preference data SimPO often wins by the paper's numbers. On noisy data DPO can be more robust. We run both on the same data and pick by held-out evals.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

SimPO vs DPO: which should we use?: Benchmark both. SimPO has the better headline numbers, DPO has more battle testing. On a new project we run both with the same data and pick by golden-set eval.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering SimPO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics