Stage 0: define the verifier
The single most important step. A pure-function verifier (does this Python pass these tests, does this JSON match this schema) is the gold standard. LLM-as-judge verifiers work but introduce reward hacking risk.
Group Relative Policy Optimization (DeepSeek, Jan 2025) trains the model with verifiable rewards (was the answer right, does the format parse, did the tool call validate) by normalizing rewards within sampled groups. Drops the value critic, drops half the engineering. OpenAI's Reinforcement Fine-Tuning API is the managed version on o4-mini at $100 per training hour, $5k per-job cap.
Math, code, tool use, and structured extraction all reward longer chains of thought and verifiable outputs. SFT trains on a single answer per prompt. DPO trains on a pair. RFT samples N attempts, scores each against a verifier, and updates the policy to favour the winning attempts. GRPO drops the value critic by computing rewards relative to the group's mean.
SFT first to get sensible attempts. Then group-sampled RL with a verifier in the loop.
The single most important step. A pure-function verifier (does this Python pass these tests, does this JSON match this schema) is the gold standard. LLM-as-judge verifiers work but introduce reward hacking risk.
GRPO from scratch on a base model takes 100x longer. SFT first gets you to attempt structures the verifier can actually grade.
Sample N attempts, score, normalize, update. KL constraint to the SFT reference keeps the policy from collapsing. Watch entropy and reward variance every step.
GRPO will find the verifier's loopholes (long-winded answers, format tricks). Eval on a verifier-independent test set every few thousand rollouts.
Math, code, structured extraction, tool use, agent decision making. Anywhere a verifier can grade the attempt as correct or not.
Subjective tasks (style, creative writing) with no clean verifier. Without an SFT starting point. Below ~$50k of compute budget for ambitious runs.
We use OpenAI RFT when the task fits the o-series and the data is small. We use TRL GRPO with Unsloth's fast rollout backend when we are training open weights, when we need to keep the verifier private, or when we need to ship the model on our own infrastructure.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more