← All methods·01 · LOW-RANK ADAPTATION · PRIMARY

★ LoRAPrimary method

METHOD · LOW-RANK ADAPTATION

LoRA. The 2026 default for almost everything.

Two small low-rank matrices that compose with frozen base weights. 99% of the accuracy of a full fine-tune, 1% of the VRAM, and the adapter file is small enough to ship as a build artifact. The first method to reach for, the one to beat before considering anything else.

PyTorchPEFTHuggingFaceTRL

Talk to our team →Fine-tuning hub

Data

1k to 50k examples

Hardware

1 to 8 GPUs (H100 or A100)

Cost

Under $500 per run for 7B to 13B

Reversibility

Hot-swappable per tenant

[WHY THIS EXISTS]

Full fine-tunes cost too much, change too much, and lock you in.

Updating every weight in a 70B model is expensive (multi-node, days), destructive (you cannot keep the base current), and operationally painful (each tenant gets a 140GB checkpoint). LoRA factors the weight update into two low-rank matrices so you train a few million parameters instead of tens of billions, and ship a 50MB adapter instead of a model.

Two matrices A (down) and B (up) of rank r approximate the weight delta
Base weights stay frozen and shared across every tenant or task
Adapter is 0.1% to 1% the size of the model. Hot-swap at request time
Recover from a bad fine-tune by deleting the adapter, not restoring a checkpoint

[THE PIPELINE]

LoRA, end to end.

Prepare data, freeze base, attach LoRA modules to attention and MLP, train, evaluate, ship adapter.

Dataset (jsonl)

Tokenise + pack

Freeze base

Attach LoRA (r=16, alpha=32, all-linear)

Train 1 to 3 epochs

Eval vs golden set

Ship adapter (50 to 200 MB)

Pick the base, not the Instruct

Start from a base model when you intend to teach a new format or behaviour. Start from an Instruct model when you only need to nudge style and want to preserve alignment. Mixing the two is the most common mistake.

Rank 16, alpha 32, all-linear

Our 2026 default. Higher rank (32 to 64) only when DoRA or rsLoRA is also in play, otherwise you trade compute for marginal accuracy. Target all linear layers (q, k, v, o, gate, up, down). Skipping the MLP projections is a quiet accuracy loss.

1 to 3 epochs, LR 1e-4 to 2e-4

Cosine schedule, warmup 3%, weight decay 0. Watch eval loss every quarter epoch. The window between underfitting and overfitting is small for LoRA, eval gates catch it.

Eval against a frozen golden set

200 to 1000 prompts captured before training began, with expected behaviour written down. Block the adapter from production if it regresses any one of: factuality, format, safety. Same gate every adapter passes.

[THE STACK WE'D DEPLOY]

What we run in production for LoRA.

PyTorchPEFTTRLUnslothAxolotl

[ACCURACY · COST · TRADE]

The numbers we measure LoRA on.

Trainable params

0.1% to 1% of base

vs 100% for full SFT

Accuracy gap vs full SFT

Within 1 to 3 points

On typical benchmarks; widens on hard reasoning per Biderman et al. 2024

Adapter size

50 to 200 MB for 7B base

Memory at training

20 to 40 GB for 7B in FP16

When it earns the build

Single-task or single-tenant adaptation, multi-tenant SaaS with per-customer adapters, anywhere you want to keep the base current with provider updates, anything where memory is tight.

When it doesn't

When you need 10x more model capacity (full SFT proven necessary in benchmarks), when the LoRA signal is being washed out by the base (try DoRA), when reasoning is the goal (consider GRPO/RFT instead).

[OUR TAKE]

Our default. We will argue with you if you want to start with anything else.

Across legal extraction, support automation, structured generation, and code modernisation projects, LoRA at rank 16 with all-linear targeting has cleared the bar every time. Reach for full SFT only after we have benchmarked LoRA and proven the gap.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

What rank should we use?: Default to 16. Lower (8) only on extremely small data (under 1k) where overfitting risk is high. Higher (32 to 64) only if you are also using DoRA or rsLoRA. Past 128 you should ask whether full SFT is the right answer.
Why all-linear targeting?: Targeting only attention layers (the original LoRA paper choice) loses 1 to 3 points on most tasks vs targeting all linear layers including the MLP projections. The compute and memory cost is small. We default to all-linear.
Can we merge the LoRA into the base for serving?: Yes, weight = base + alpha/r * BA. Merging removes the per-request overhead but kills the hot-swap story. If you only serve one adapter at a time, merge. If you serve N customers, use vLLM multi-LoRA or LoRAX and keep them as adapters.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering LoRA? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics