← All methods·05 · REINFORCEMENT FINE-TUNING · PRIMARY

★ GRPO / RFTPrimary method

METHOD · REINFORCEMENT FINE-TUNING

Reasoning fine-tuning. GRPO, R1's secret, OpenAI's RFT API in production.

Group Relative Policy Optimization (DeepSeek, Jan 2025) trains the model with verifiable rewards (was the answer right, does the format parse, did the tool call validate) by normalizing rewards within sampled groups. Drops the value critic, drops half the engineering. OpenAI's Reinforcement Fine-Tuning API is the managed version on o4-mini at $100 per training hour, $5k per-job cap.

TRL (GRPO)OpenAI RFTUnsloth GRPOvLLM rollouts

Talk to our team →Fine-tuning hub

Data

Verifiable-reward tasks (math, code, tool use)

Hardware

Multi-day GPU runs

Cost

$100 / hr managed (o4-mini), $5k cap per job

Use

Reasoning, structured output, agent tools

[WHY THIS EXISTS]

SFT and DPO cannot teach the model to think harder.

Math, code, tool use, and structured extraction all reward longer chains of thought and verifiable outputs. SFT trains on a single answer per prompt. DPO trains on a pair. RFT samples N attempts, scores each against a verifier, and updates the policy to favour the winning attempts. GRPO drops the value critic by computing rewards relative to the group's mean.

Sample N attempts per prompt, score against a verifier (correct, parses, validates)
Reward = (attempt reward - group mean) / group std deviation
Policy update favours above-group attempts, penalises below
Works only with verifiable rewards. No verifier means no GRPO

[THE PIPELINE]

GRPO / RFT, end to end.

SFT first to get sensible attempts. Then group-sampled RL with a verifier in the loop.

SFT checkpoint

Verifier (rule, code, LLM-judge)

Sample N attempts

Score, normalize within group

Policy update (PPO-like)

Repeat for thousands of rollouts

Stage 0: define the verifier

The single most important step. A pure-function verifier (does this Python pass these tests, does this JSON match this schema) is the gold standard. LLM-as-judge verifiers work but introduce reward hacking risk.

Stage 1: SFT to a sensible starting policy

GRPO from scratch on a base model takes 100x longer. SFT first gets you to attempt structures the verifier can actually grade.

Stage 2: GRPO with N=4 to 16 per prompt

Sample N attempts, score, normalize, update. KL constraint to the SFT reference keeps the policy from collapsing. Watch entropy and reward variance every step.

Stage 3: eval on held-out, watch for reward hacking

GRPO will find the verifier's loopholes (long-winded answers, format tricks). Eval on a verifier-independent test set every few thousand rollouts.

[THE STACK WE'D DEPLOY]

What we run in production for GRPO / RFT.

TRL (GRPOTrainer)OpenAI RFT APIUnsloth GRPOvLLM (rollout server)Predibase RFT

[ACCURACY · COST · TRADE]

The numbers we measure GRPO / RFT on.

DeepSeek-R1 distillation result

Beat o1-mini on math/code at far smaller scale

Used 800k verified reasoning trajectories

OpenAI RFT (o4-mini)

$100/hr, $5k cap per job

Sample efficiency

GRPO drops value critic vs PPO, halves memory

Risk

Reward hacking, format collapse

When it earns the build

Math, code, structured extraction, tool use, agent decision making. Anywhere a verifier can grade the attempt as correct or not.

When it doesn't

Subjective tasks (style, creative writing) with no clean verifier. Without an SFT starting point. Below ~$50k of compute budget for ambitious runs.

[OUR TAKE]

The breakout method of 2025. OpenAI RFT for managed, TRL GRPO for self-hosted.

We use OpenAI RFT when the task fits the o-series and the data is small. We use TRL GRPO with Unsloth's fast rollout backend when we are training open weights, when we need to keep the verifier private, or when we need to ship the model on our own infrastructure.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

Do we need a verifier? What counts as one?: Yes. A verifier is anything that takes a model attempt and returns a scalar reward. Best: a unit test, a JSON schema validator, a regex match against a known answer. Acceptable with caveats: an LLM-judge with a calibration set. Without any of these, GRPO has nothing to optimize.
OpenAI RFT vs self-hosted GRPO?: OpenAI RFT is faster to start, the bill is bounded ($5k cap per job), and it runs on closed o-series models. Self-hosted GRPO requires a rollout server and verifier infra, can run on any open base, and stays in your VPC. Pick by data sensitivity and base-model preference.
What's the data volume for GRPO?: Hundreds of verifiable prompts is enough if the verifier is good and the prompt distribution is right. The compute, not the data, is the constraint.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering GRPO / RFT? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics