Skip the SFT stage
Train directly with the ORPO loss on preference data. No separate SFT checkpoint.
Hong et al., 2024 merge supervised fine-tuning and preference learning into a single loss. Trains on (prompt, chosen, rejected) data once, no separate SFT step required. Useful when the compute budget is for one run, not two.
ORPO adds an odds-ratio term to a standard SFT loss so the model learns the supervised target while pushing rejected responses down. One dataset, one run, one set of hyperparameters.
One run, one loss.
Train directly with the ORPO loss on preference data. No separate SFT checkpoint.
Default 0.1. Higher pushes preference signal harder at the cost of SFT fidelity.
One-shot fine-tunes, tight budgets, projects where the SFT+DPO loop has not paid off.
When you want explicit control over the SFT stage, or when DPO has measurably won on this data.
We use ORPO when the project has one compute window and the preference data is clean. Otherwise we stage SFT then DPO for the control.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more