Kensink Labs
← All methods·09 · BINARY-FEEDBACK PREFERENCE · SPECIALISED
KTOSpecialised
METHOD · KAHNEMAN-TVERSKY OPTIMIZATION

KTO. Preference learning from thumbs up/down, not pairs.

KTO (Ethayarajh et al., 2024) trains on individual binary signals: this response was good, this one was bad. Drops the (chosen, rejected) pair requirement of DPO and SimPO. Matches reality: production feedback is thumbs, not side-by-side comparisons.

TRLPEFT
Data shape
(prompt, response, good/bad)
Source
Production thumbs, edit deltas
Trade
Needs class balance discipline
[WHY THIS EXISTS]

Most production feedback is single-response, not paired.

DPO needs a (chosen, rejected) pair. Most production captures thumbs on one response at a time. KTO uses the single-response binary signal directly, with a Kahneman-Tversky utility function (loss aversion) to balance the asymmetry between good and bad.

  • (prompt, response, label) triples, where label is thumbs up or thumbs down
  • Loss weights bad more heavily than good (loss aversion)
  • Requires a desirable:undesirable ratio between roughly 1:10 and 10:1
[THE PIPELINE]

KTO, end to end.

Capture thumbs in production. Train KTO. Repeat.

Production thumbs
(prompt, response, label)
KTO loss
Train
Ship + next round
01

Capture thumbs from day one

Production thumbs and edit deltas (the user's edit becomes the desirable response, the original becomes undesirable). This is the feedback loop the fine-tune runs on.

02

Train KTO with desirable_weight, undesirable_weight

Balance the loss for the class ratio. TRL handles the math.

[THE STACK WE'D DEPLOY]

What we run in production for KTO.

TRL (KTOTrainer)PEFT
[ACCURACY · COST · TRADE]

The numbers we measure KTO on.

Data shape
Single response + label
Loss asymmetry
Bad weighted ~2x good
Tunable
When it earns the build

When production feedback is thumbs, when you cannot run pairwise comparisons, when iterative shipping is the discipline.

When it doesn't

When you have clean pairwise data (use DPO or SimPO), when class balance is extreme (1:100+).

[OUR TAKE]

Underrated. The right answer when feedback is thumbs.

Most production teams have thumbs data and no pairwise data. KTO is the method that maps onto that reality.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

KTO vs DPO when we have both kinds of data?
If pairs are clean and abundant, DPO. If thumbs are abundant and pairs are sparse, KTO. If you have both, benchmark; KTO often wins on production data because the distribution matches deployment.
FINE-TUNING · KENSINK LABS

Considering KTO? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.