Curate the corpus
Clean, dedup (NeMo Curator or your own MinHash + SemDedup), filter for quality. Garbage tokens at this stage cost real money and degrade the base.
Self-supervised next-token training on a large unlabelled domain corpus. The right answer before SFT when the domain has new vocabulary, tokenization, or scripts (legal Latin, ICD codes, chemistry SMILES, non-Latin languages). Typically 1B to 100B tokens, $10k to $500k.
A base model trained on the open web has seen English but not legal Latin, has seen markdown but not LaTeX, has seen Python but not industrial PLC code, has seen Mandarin but not under-resourced regional scripts. SFT teaches behaviour, not vocabulary. CPT teaches vocabulary first.
Curate domain corpus, optionally extend vocab, train, then SFT and DPO.
Clean, dedup (NeMo Curator or your own MinHash + SemDedup), filter for quality. Garbage tokens at this stage cost real money and degrade the base.
Train a domain tokenizer, identify high-frequency new tokens, extend the base tokenizer and embedding matrix. Initialize new embeddings carefully.
LR 3e-5 to 1e-4. Mix 5 to 20% of the original instruction-tuning data back in (replay) to limit forgetting.
CPT alone teaches vocabulary but not task. SFT on labelled domain data, then DPO on preferences.
New vocabulary, new tokenization, foreign scripts, deep domain language (legal, biomedical, code, non-English).
Narrow task adaptation (SFT alone), small data (under 1B tokens), domains the base already saw enough of.
We start with SFT and benchmark. If the gap to a target metric is structural (the base does not know the vocabulary), we add CPT. Otherwise we save the budget.
Sourcing, PII redaction (Presidio), synthetic data (Distilabel, Nemotron), DEITA quality scoring, MinHash + SemDedup, labeling vendors, feedback loops.
Read moreOpenAI RFT, Anthropic on Bedrock, Vertex, Azure Foundry, Databricks Mosaic, Together, Predibase, NeMo Customizer, Modal, Lambda. Side-by-side with our take.
Read moreUnder 1k examples to over 1M, single A10G to 128 B200. Indicative cost, recommended method, hardware tier.
Read moreContinued pretraining, SFT, preference optimization (DPO, SimPO, ORPO), reasoning distillation (R1 lineage), model merging (TIES, DARE). The full build pipeline.
Read moreEU AI Act (Article 25 substantial-modification trap), GDPR, HIPAA, FedRAMP, Colorado AI Act, India DPDP, China GenAI Measures. Region-by-region for tuned LLMs.
Read more