Kensink Labs
← All methods·10 · DOMAIN-ADAPTIVE PRETRAINING · SPECIALISED
CPTSpecialised
METHOD · CONTINUED PRETRAINING

Continued pretraining. When the domain has its own vocabulary.

Self-supervised next-token training on a large unlabelled domain corpus. The right answer before SFT when the domain has new vocabulary, tokenization, or scripts (legal Latin, ICD codes, chemistry SMILES, non-Latin languages). Typically 1B to 100B tokens, $10k to $500k.

PyTorch FSDPDeepSpeedMegatron-LMNeMo Curator
Data
1B to 100B unlabelled tokens
Hardware
Multi-node H100/H200 cluster
Cost
$10k to $500k
Pairs with
SFT + DPO afterwards
[WHY THIS EXISTS]

SFT cannot fix vocabulary the base never saw.

A base model trained on the open web has seen English but not legal Latin, has seen markdown but not LaTeX, has seen Python but not industrial PLC code, has seen Mandarin but not under-resourced regional scripts. SFT teaches behaviour, not vocabulary. CPT teaches vocabulary first.

  • Self-supervised: predict the next token on raw domain corpus
  • Optional vocabulary extension (add tokens, expand embedding matrix)
  • Long, lower-LR run (3e-5 to 1e-4) compared to SFT
  • Then SFT and optional DPO to teach the actual task
[THE PIPELINE]

CPT, end to end.

Curate domain corpus, optionally extend vocab, train, then SFT and DPO.

Raw domain corpus
Clean + dedup (NeMo Curator)
Tokenize (optional vocab extension)
CPT (1B to 100B tokens)
Checkpoint per 1B tokens
SFT + DPO on labelled domain data
Ship
01

Curate the corpus

Clean, dedup (NeMo Curator or your own MinHash + SemDedup), filter for quality. Garbage tokens at this stage cost real money and degrade the base.

02

Vocab extension if needed

Train a domain tokenizer, identify high-frequency new tokens, extend the base tokenizer and embedding matrix. Initialize new embeddings carefully.

03

CPT with low LR + replay

LR 3e-5 to 1e-4. Mix 5 to 20% of the original instruction-tuning data back in (replay) to limit forgetting.

04

Stage 2 + 3: SFT then DPO

CPT alone teaches vocabulary but not task. SFT on labelled domain data, then DPO on preferences.

[THE STACK WE'D DEPLOY]

What we run in production for Continued pretraining.

PyTorch FSDPMegatron-LMDeepSpeedNeMo CuratorTogether AI CPT
[ACCURACY · COST · TRADE]

The numbers we measure Continued pretraining on.

Token budget
1B to 100B
Cost (8B base, 10B tokens)
$10k to $30k cloud
Risk
Forgetting if replay is missing
When it earns the build

New vocabulary, new tokenization, foreign scripts, deep domain language (legal, biomedical, code, non-English).

When it doesn't

Narrow task adaptation (SFT alone), small data (under 1B tokens), domains the base already saw enough of.

[OUR TAKE]

Necessary for foreign-vocabulary domains, overkill for everything else.

We start with SFT and benchmark. If the gap to a target metric is structural (the base does not know the vocabulary), we add CPT. Otherwise we save the budget.

[COMMON QUESTIONS]

What buyers ask before they sign.

CPT or SFT for medical terminology?
Try SFT first with domain-rich data. If the model still confuses ICD codes or anatomical terms, CPT on a curated corpus of medical literature is the right next step.
Do we need to extend the vocabulary?
Only when the tokenization is genuinely inefficient on domain text (10x more tokens per character than English). For most enterprise domains, the existing tokenizer is fine.
FINE-TUNING · KENSINK LABS

Considering Continued pretraining? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.