← All methods·09 · BINARY-FEEDBACK PREFERENCE · SPECIALISED

★ KTOSpecialised

METHOD · KAHNEMAN-TVERSKY OPTIMIZATION

KTO. Preference learning from thumbs up/down, not pairs.

KTO (Ethayarajh et al., 2024) trains on individual binary signals: this response was good, this one was bad. Drops the (chosen, rejected) pair requirement of DPO and SimPO. Matches reality: production feedback is thumbs, not side-by-side comparisons.

TRLPEFT

Talk to our team →Fine-tuning hub

Data shape

(prompt, response, good/bad)

Source

Production thumbs, edit deltas

Trade

Needs class balance discipline

[WHY THIS EXISTS]

Most production feedback is single-response, not paired.

DPO needs a (chosen, rejected) pair. Most production captures thumbs on one response at a time. KTO uses the single-response binary signal directly, with a Kahneman-Tversky utility function (loss aversion) to balance the asymmetry between good and bad.

(prompt, response, label) triples, where label is thumbs up or thumbs down
Loss weights bad more heavily than good (loss aversion)
Requires a desirable:undesirable ratio between roughly 1:10 and 10:1

[THE PIPELINE]

KTO, end to end.

Capture thumbs in production. Train KTO. Repeat.

Production thumbs

(prompt, response, label)

KTO loss

Train

Ship + next round

Capture thumbs from day one

Production thumbs and edit deltas (the user's edit becomes the desirable response, the original becomes undesirable). This is the feedback loop the fine-tune runs on.

Train KTO with desirable_weight, undesirable_weight

Balance the loss for the class ratio. TRL handles the math.

[THE STACK WE'D DEPLOY]

What we run in production for KTO.

TRL (KTOTrainer)PEFT

[ACCURACY · COST · TRADE]

The numbers we measure KTO on.

Data shape

Single response + label

Loss asymmetry

Bad weighted ~2x good

Tunable

When it earns the build

When production feedback is thumbs, when you cannot run pairwise comparisons, when iterative shipping is the discipline.

When it doesn't

When you have clean pairwise data (use DPO or SimPO), when class balance is extreme (1:100+).

[OUR TAKE]

Underrated. The right answer when feedback is thumbs.

Most production teams have thumbs data and no pairwise data. KTO is the method that maps onto that reality.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

KTO vs DPO when we have both kinds of data?: If pairs are clean and abundant, DPO. If thumbs are abundant and pairs are sparse, KTO. If you have both, benchmark; KTO often wins on production data because the distribution matches deployment.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering KTO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics