Quantize base, freeze, attach LoRA
bitsandbytes 4-bit NF4 with double-quantization. Compute dtype bf16 on H100, fp16 elsewhere. LoRA stays in FP16 or bf16, not quantized.
LoRA on top of a 4-bit NF4 quantized base. A 65B model trains on a single 48GB GPU. The accuracy hit is real but small, the cost reduction is not. Our default for 70B fine-tunes on a single node and for any team without dedicated H100s.
That is a multi-GPU job for a 7-figure team. QLoRA reframes the base in 4-bit NF4 (a quantization tuned to normally-distributed weights), pages large hidden states to CPU, then runs LoRA on top. The base shrinks 4x, the LoRA stays in FP16, and you train Llama 65B on one 48GB card.
Load base in 4-bit NF4, attach LoRA, train with paged optimizer, evaluate, ship.
bitsandbytes 4-bit NF4 with double-quantization. Compute dtype bf16 on H100, fp16 elsewhere. LoRA stays in FP16 or bf16, not quantized.
8-bit AdamW with paged optimizer offloads optimizer state to CPU on gradient spikes. Without it, sequence-length spikes OOM the GPU on the second epoch.
Most QLoRA jobs are single-node. Gradient accumulation handles small effective batch sizes. Watch loss every quarter epoch.
Inference in 4-bit is slower than FP16 LoRA + base. The standard pattern is: QLoRA for training, dequantize the base + merge LoRA for serving, run on vLLM in FP16.
When the base does not fit in FP16 LoRA VRAM, when single-GPU training is the budget, when iteration speed needs many cheap experiments, when serving will use a separate FP16 path.
When you can afford FP16 LoRA on the available hardware (FP16 is faster and slightly more accurate), when the 0.5 to 1.5 point accuracy gap is the whole job's failure mode.
We use QLoRA when the model would not otherwise fit and FP16 LoRA when it would. Same hyperparameters, same eval gates, same multi-tenant adapter serving downstream.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more