Kensink Labs
← All methods·08 · ONE-STAGE SFT + PREFERENCE · SPECIALISED
ORPOSpecialised
METHOD · ODDS-RATIO PREFERENCE OPTIMIZATION

ORPO. SFT and preference learning, in one stage.

Hong et al., 2024 merge supervised fine-tuning and preference learning into a single loss. Trains on (prompt, chosen, rejected) data once, no separate SFT step required. Useful when the compute budget is for one run, not two.

TRLPEFT
Stages
1 (vs 2 for SFT then DPO)
Cost
~ DPO
Trade
Less control over each stage
[WHY THIS EXISTS]

SFT then DPO is two runs. Two runs is two budgets.

ORPO adds an odds-ratio term to a standard SFT loss so the model learns the supervised target while pushing rejected responses down. One dataset, one run, one set of hyperparameters.

  • Loss = SFT NLL + lambda * odds-ratio (chosen vs rejected)
  • Same training data shape as DPO
  • Lambda controls how strong the preference push is
[THE PIPELINE]

ORPO, end to end.

One run, one loss.

Base model
(prompt, chosen, rejected)
ORPO loss
Train
Eval
Ship
01

Skip the SFT stage

Train directly with the ORPO loss on preference data. No separate SFT checkpoint.

02

Tune lambda

Default 0.1. Higher pushes preference signal harder at the cost of SFT fidelity.

[THE STACK WE'D DEPLOY]

What we run in production for ORPO.

TRL (ORPOTrainer)PEFT
[ACCURACY · COST · TRADE]

The numbers we measure ORPO on.

Compute vs SFT+DPO
~50% (one stage)
Quality vs SFT+DPO
Close on clean data
When it earns the build

One-shot fine-tunes, tight budgets, projects where the SFT+DPO loop has not paid off.

When it doesn't

When you want explicit control over the SFT stage, or when DPO has measurably won on this data.

[OUR TAKE]

A compute-saver. Worth benchmarking when budget pressure is real.

We use ORPO when the project has one compute window and the preference data is clean. Otherwise we stage SFT then DPO for the control.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

ORPO vs SFT then DPO?
ORPO collapses two stages to one and halves the compute. Quality is close on clean preference data, worse on noisy. We benchmark both before committing.
FINE-TUNING · KENSINK LABS

Considering ORPO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.