Kensink Labs
By scaleDirect LLM · benchmark-firstProduction grade
FINE-TUNING · SCALE

Four named playbooks. By data, by compute, by cost.

The right method changes with the data volume and the compute budget. We name four tiers, the method for each, the hardware that fits, and an honest cost range.

H100H200B200A10GFSDPDeepSpeed
Tiers
4 (Tiny · Small · Mid · Large)
Range
<1k → 1M+ labelled examples
Default tiny
Few-shot + DSPy, no fine-tune
Default mid
LoRA r=16 with DoRA, single H100
[FOUR TIERS]

The architecture changes with the scale.

Each tier card carries a different brand gradient so the eye can scan across at a glance. The method, hardware, and indicative cost are the durable parts. Pricing moves quarterly; we re-validate every engagement.

Tier
Tiny
Under 1,000 labelled examples

Proof of concept, narrow specialist, before-data shipping. The cheapest path is usually not fine-tuning.

Method
Few-shot + DSPy / GEPA prompt optimization
Hardware
Inference only
Indicative cost
$0 to $100 in API spend
Tier
Small
1,000 to 50,000 examples

Most enterprise fine-tunes land here: support tuning, style alignment, structured extraction, per-customer adapters. LoRA's sweet spot.

Method
LoRA r=16 with DoRA, all-linear targeting
Hardware
1 GPU (A10G, 4090, A100, or 1x H100)
Indicative cost
$10 to $500 per run
Tier
Mid
50,000 to 1,000,000 examples

Multi-tenant SaaS with diverse customer data, deep domain adaptation, models that need cross-task generalization.

Method
DoRA + DPO, or QLoRA on a 70B base
Hardware
1 to 8 H100s, FSDP
Indicative cost
$500 to $5,000 per run
Tier
Large
1M+ examples or CPT + SFT pipelines

Continued pretraining for foreign vocabulary, full SFT on hard reasoning, GRPO/RFT runs that need thousands of rollouts, custom model builds.

Method
Full SFT or CPT+SFT+DPO+GRPO pipeline
Hardware
8 to 128 H100/H200/B200, FSDP or Megatron
Indicative cost
$5,000 to $200,000+ per run
[HARDWARE IN 2026]

GPUs we deploy on, by tier.

The 2025-2026 supply has shifted. H100 stays the workhorse, H200 broadly available, B200 shipping with ~2.5x H100 training performance.

24 GB
A10G

Single-GPU LoRA on 7B base, QLoRA on 13B. Cheap iteration.

~$1.10/hr cloud
80 GB
H100

Workhorse. FP16 LoRA on 7B-13B, QLoRA on 70B. 8x H100 for FSDP SFT to 70B.

$2-4/GPU-hr cloud
141 GB
H200

New default for 70B+ FP16 fine-tunes on a single card. Broadly available across 24+ providers.

$2.10-$10.60/GPU-hr
192 GB
B200 (Blackwell)

~2.5x H100 training perf. CPT, full SFT at 100B+, frontier RFT runs.

$2.99-$6/GPU-hr
[WHAT YOU GET]

What's documented at handoff.

1 tier
Scale tier picked with rationale
1 method
Method chosen and justified
1 budget
Cost range agreed before any run
1 plan
Iteration loop, not a moonshot
[COMMON QUESTIONS]

What buyers ask before they sign.

How do we estimate cost for our fine-tune?
Rough order of magnitude for SFT on 1B training tokens (1 epoch, 8B base model): ~6-10 H100-hours, so $25-60 cloud. For 70B SFT: ~150-250 H100-hours, $700-$1,500. Multiply by epochs. LoRA at the same scale is roughly 30-50% cheaper. QLoRA on a 70B fits one 48GB GPU so the cost collapses again.
Does QLoRA always fit if FP16 LoRA does not?
Almost always. NF4 + double-quantization shrinks the base to ~25% of FP16. A 70B model in QLoRA needs about 46GB at training; a single A100 80GB or H100 80GB handles it comfortably. The accuracy gap to FP16 LoRA is 0.5 to 1.5 points on typical benchmarks.
When do we need a multi-node B200 cluster?
Full SFT on 70B+ at scale, continued pretraining at 10B+ tokens, GRPO/RFT runs that need thousands of rollouts on a frontier-grade base. Below that, single-node FSDP on 8 H100s is enough. Above that, Lambda 1-Click Clusters or CoreWeave are the right answer.
What's the iteration-cost trade?
Single-GPU LoRA is the fastest iteration loop (minutes to hours per run). Multi-node full SFT is the slowest (hours to days). Build the first three to five iterations on LoRA, validate the data and the approach, then escalate to full SFT only if benchmarks force it.
FINE-TUNING · SCALE · KENSINK LABS

Size the work to the data. Not the other way around.

We start small, benchmark fast, escalate only when the numbers force it. The first iteration is cheap by design.