Kensink Labs
← All methods·02 · QUANTIZED LORA · PRIMARY
QLoRAPrimary method
METHOD · 4-BIT QUANTIZED LORA

QLoRA. LoRA when VRAM is the budget you do not have.

LoRA on top of a 4-bit NF4 quantized base. A 65B model trains on a single 48GB GPU. The accuracy hit is real but small, the cost reduction is not. Our default for 70B fine-tunes on a single node and for any team without dedicated H100s.

PyTorchbitsandbytesPEFTUnsloth
Data
1k to 100k examples
Hardware
1 GPU (A10G, 4090, A100 48GB, H100)
Cost
Under $10 for 7B on cloud A10G
Trade
0.5 to 1.5 point drop vs FP16 LoRA
[WHY THIS EXISTS]

FP16 LoRA on a 70B base still needs 140GB of VRAM.

That is a multi-GPU job for a 7-figure team. QLoRA reframes the base in 4-bit NF4 (a quantization tuned to normally-distributed weights), pages large hidden states to CPU, then runs LoRA on top. The base shrinks 4x, the LoRA stays in FP16, and you train Llama 65B on one 48GB card.

  • 4-bit NF4 quantization of the frozen base; LoRA adapter stays FP16
  • Paged optimizers (8-bit) handle gradient spikes without OOM
  • Double-quantization shaves another ~0.4 bits per param
  • Reaches within 1 to 2 points of full FP16 LoRA on typical tasks
[THE PIPELINE]

QLoRA, the cookbook.

Load base in 4-bit NF4, attach LoRA, train with paged optimizer, evaluate, ship.

Base model
Quantize NF4 + DQ
Attach LoRA r=16
Paged 8-bit AdamW
Train 1 to 3 epochs
Eval vs golden set
Ship adapter
01

Quantize base, freeze, attach LoRA

bitsandbytes 4-bit NF4 with double-quantization. Compute dtype bf16 on H100, fp16 elsewhere. LoRA stays in FP16 or bf16, not quantized.

02

Paged optimizer for spikes

8-bit AdamW with paged optimizer offloads optimizer state to CPU on gradient spikes. Without it, sequence-length spikes OOM the GPU on the second epoch.

03

Train at 1 GPU, validate at 1 GPU

Most QLoRA jobs are single-node. Gradient accumulation handles small effective batch sizes. Watch loss every quarter epoch.

04

Merge to FP16 for production serving

Inference in 4-bit is slower than FP16 LoRA + base. The standard pattern is: QLoRA for training, dequantize the base + merge LoRA for serving, run on vLLM in FP16.

[THE STACK WE'D DEPLOY]

What we run in production for QLoRA.

bitsandbytesPEFTTRLUnslothAxolotl
[ACCURACY · COST · TRADE]

The numbers we measure QLoRA on.

Accuracy gap vs FP16 LoRA
0.5 to 1.5 points
On benchmark suites; smaller on narrow tasks
VRAM (7B training)
~10 GB
vs ~28 GB for FP16 LoRA
VRAM (70B training)
~46 GB
Fits on a single 48GB GPU
Training throughput
30% to 50% slower than FP16 LoRA
When it earns the build

When the base does not fit in FP16 LoRA VRAM, when single-GPU training is the budget, when iteration speed needs many cheap experiments, when serving will use a separate FP16 path.

When it doesn't

When you can afford FP16 LoRA on the available hardware (FP16 is faster and slightly more accurate), when the 0.5 to 1.5 point accuracy gap is the whole job's failure mode.

[OUR TAKE]

Our default for 70B-class fine-tunes on a single node.

We use QLoRA when the model would not otherwise fit and FP16 LoRA when it would. Same hyperparameters, same eval gates, same multi-tenant adapter serving downstream.

[COMMON QUESTIONS]

What buyers ask before they sign.

Should we serve the QLoRA model in 4-bit too?
Usually no. Dequantize the base to FP16 and merge or attach the LoRA on top, then serve on vLLM. 4-bit inference is slower and the small VRAM saving rarely pays the latency tax in production.
How does QLoRA compare to LoRA on H100s?
On H100 with enough VRAM for FP16 LoRA, use FP16. QLoRA's value is enabling 70B-class fine-tunes on cheaper hardware, not winning on H100s with budget.
Does the 4-bit quantization survive merging back to FP16?
The base weights stay 4-bit in storage during training; for serving you dequantize the base to FP16 (a one-off operation) and merge. The training trajectory was guided by the quantized base, so the resulting model is slightly different from FP16 LoRA, but the gap is small.
FINE-TUNING · KENSINK LABS

Considering QLoRA? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.