Capture thumbs from day one
Production thumbs and edit deltas (the user's edit becomes the desirable response, the original becomes undesirable). This is the feedback loop the fine-tune runs on.
KTO (Ethayarajh et al., 2024) trains on individual binary signals: this response was good, this one was bad. Drops the (chosen, rejected) pair requirement of DPO and SimPO. Matches reality: production feedback is thumbs, not side-by-side comparisons.
DPO needs a (chosen, rejected) pair. Most production captures thumbs on one response at a time. KTO uses the single-response binary signal directly, with a Kahneman-Tversky utility function (loss aversion) to balance the asymmetry between good and bad.
Capture thumbs in production. Train KTO. Repeat.
Production thumbs and edit deltas (the user's edit becomes the desirable response, the original becomes undesirable). This is the feedback loop the fine-tune runs on.
Balance the loss for the class ratio. TRL handles the math.
When production feedback is thumbs, when you cannot run pairwise comparisons, when iterative shipping is the discipline.
When you have clean pairwise data (use DPO or SimPO), when class balance is extreme (1:100+).
Most production teams have thumbs data and no pairwise data. KTO is the method that maps onto that reality.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more