★ By scaleDirect LLM · benchmark-firstProduction grade

FINE-TUNING · SCALE

Four named playbooks. By data, by compute, by cost.

The right method changes with the data volume and the compute budget. We name four tiers, the method for each, the hardware that fits, and an honest cost range.

H100H200B200A10GFSDPDeepSpeed

Start a conversation →Fine-tuning hub →

Tiers

4 (Tiny · Small · Mid · Large)

Range

<1k → 1M+ labelled examples

Default tiny

Few-shot + DSPy, no fine-tune

Default mid

LoRA r=16 with DoRA, single H100

[FOUR TIERS]

The architecture changes with the scale.

Each tier card carries a different brand gradient so the eye can scan across at a glance. The method, hardware, and indicative cost are the durable parts. Pricing moves quarterly; we re-validate every engagement.

Tier

Tiny

Under 1,000 labelled examples

Proof of concept, narrow specialist, before-data shipping. The cheapest path is usually not fine-tuning.

Method: Few-shot + DSPy / GEPA prompt optimization
Hardware: Inference only
Indicative cost: $0 to $100 in API spend

Tier

Small

1,000 to 50,000 examples

Most enterprise fine-tunes land here: support tuning, style alignment, structured extraction, per-customer adapters. LoRA's sweet spot.

Method: LoRA r=16 with DoRA, all-linear targeting
Hardware: 1 GPU (A10G, 4090, A100, or 1x H100)
Indicative cost: $10 to $500 per run

Tier

Mid

50,000 to 1,000,000 examples

Multi-tenant SaaS with diverse customer data, deep domain adaptation, models that need cross-task generalization.

Method: DoRA + DPO, or QLoRA on a 70B base
Hardware: 1 to 8 H100s, FSDP
Indicative cost: $500 to $5,000 per run

Tier

Large

1M+ examples or CPT + SFT pipelines

Continued pretraining for foreign vocabulary, full SFT on hard reasoning, GRPO/RFT runs that need thousands of rollouts, custom model builds.

Method: Full SFT or CPT+SFT+DPO+GRPO pipeline
Hardware: 8 to 128 H100/H200/B200, FSDP or Megatron
Indicative cost: $5,000 to $200,000+ per run

[HARDWARE IN 2026]

GPUs we deploy on, by tier.

The 2025-2026 supply has shifted. H100 stays the workhorse, H200 broadly available, B200 shipping with ~2.5x H100 training performance.

24 GB

A10G

Single-GPU LoRA on 7B base, QLoRA on 13B. Cheap iteration.

~$1.10/hr cloud

80 GB

H100

Workhorse. FP16 LoRA on 7B-13B, QLoRA on 70B. 8x H100 for FSDP SFT to 70B.

$2-4/GPU-hr cloud

141 GB

H200

New default for 70B+ FP16 fine-tunes on a single card. Broadly available across 24+ providers.

$2.10-$10.60/GPU-hr

192 GB

B200 (Blackwell)

~2.5x H100 training perf. CPT, full SFT at 100B+, frontier RFT runs.

$2.99-$6/GPU-hr

[WHAT YOU GET]

What's documented at handoff.

1 tier

Scale tier picked with rationale

1 method

Method chosen and justified

1 budget

Cost range agreed before any run

1 plan

Iteration loop, not a moonshot

[COMMON QUESTIONS]

What buyers ask before they sign.

How do we estimate cost for our fine-tune?: Rough order of magnitude for SFT on 1B training tokens (1 epoch, 8B base model): ~6-10 H100-hours, so $25-60 cloud. For 70B SFT: ~150-250 H100-hours, $700-$1,500. Multiply by epochs. LoRA at the same scale is roughly 30-50% cheaper. QLoRA on a 70B fits one 48GB GPU so the cost collapses again.
Does QLoRA always fit if FP16 LoRA does not?: Almost always. NF4 + double-quantization shrinks the base to ~25% of FP16. A 70B model in QLoRA needs about 46GB at training; a single A100 80GB or H100 80GB handles it comfortably. The accuracy gap to FP16 LoRA is 0.5 to 1.5 points on typical benchmarks.
When do we need a multi-node B200 cluster?: Full SFT on 70B+ at scale, continued pretraining at 10B+ tokens, GRPO/RFT runs that need thousands of rollouts on a frontier-grade base. Below that, single-node FSDP on 8 H100s is enough. Above that, Lambda 1-Click Clusters or CoreWeave are the right answer.
What's the iteration-cost trade?: Single-GPU LoRA is the fastest iteration loop (minutes to hours per run). Multi-node full SFT is the slowest (hours to days). Build the first three to five iterations on LoRA, validate the data and the approach, then escalate to full SFT only if benchmarks force it.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

01 · FINE-TUNING

Size the work to the data. Not the other way around.

We start small, benchmark fast, escalate only when the numbers force it. The first iteration is cheap by design.

Start a conversation →All fine-tuning topics

Four named playbooks. By data, by compute, by cost.

The architecture changes with the scale.

GPUs we deploy on, by tier.

What's documented at handoff.

What buyers ask before they sign.

Worth a look next.

Methods

Data pipeline

Platforms

Custom model build

Compliance

Size the work to the data. Not the other way around.