← All methods·10 · DOMAIN-ADAPTIVE PRETRAINING · SPECIALISED

★ CPTSpecialised

METHOD · CONTINUED PRETRAINING

Continued pretraining. When the domain has its own vocabulary.

Self-supervised next-token training on a large unlabelled domain corpus. The right answer before SFT when the domain has new vocabulary, tokenization, or scripts (legal Latin, ICD codes, chemistry SMILES, non-Latin languages). Typically 1B to 100B tokens, $10k to $500k.

PyTorch FSDPDeepSpeedMegatron-LMNeMo Curator

Talk to our team →Fine-tuning hub

Data

1B to 100B unlabelled tokens

Hardware

Multi-node H100/H200 cluster

Cost

$10k to $500k

Pairs with

SFT + DPO afterwards

[WHY THIS EXISTS]

SFT cannot fix vocabulary the base never saw.

A base model trained on the open web has seen English but not legal Latin, has seen markdown but not LaTeX, has seen Python but not industrial PLC code, has seen Mandarin but not under-resourced regional scripts. SFT teaches behaviour, not vocabulary. CPT teaches vocabulary first.

Self-supervised: predict the next token on raw domain corpus
Optional vocabulary extension (add tokens, expand embedding matrix)
Long, lower-LR run (3e-5 to 1e-4) compared to SFT
Then SFT and optional DPO to teach the actual task

[THE PIPELINE]

CPT, end to end.

Curate domain corpus, optionally extend vocab, train, then SFT and DPO.

Raw domain corpus

Clean + dedup (NeMo Curator)

Tokenize (optional vocab extension)

CPT (1B to 100B tokens)

Checkpoint per 1B tokens

SFT + DPO on labelled domain data

Ship

Curate the corpus

Clean, dedup (NeMo Curator or your own MinHash + SemDedup), filter for quality. Garbage tokens at this stage cost real money and degrade the base.

Vocab extension if needed

Train a domain tokenizer, identify high-frequency new tokens, extend the base tokenizer and embedding matrix. Initialize new embeddings carefully.

CPT with low LR + replay

LR 3e-5 to 1e-4. Mix 5 to 20% of the original instruction-tuning data back in (replay) to limit forgetting.

Stage 2 + 3: SFT then DPO

CPT alone teaches vocabulary but not task. SFT on labelled domain data, then DPO on preferences.

[THE STACK WE'D DEPLOY]

What we run in production for Continued pretraining.

PyTorch FSDPMegatron-LMDeepSpeedNeMo CuratorTogether AI CPT

[ACCURACY · COST · TRADE]

The numbers we measure Continued pretraining on.

Token budget

1B to 100B

Cost (8B base, 10B tokens)

$10k to $30k cloud

Risk

Forgetting if replay is missing

When it earns the build

New vocabulary, new tokenization, foreign scripts, deep domain language (legal, biomedical, code, non-English).

When it doesn't

Narrow task adaptation (SFT alone), small data (under 1B tokens), domains the base already saw enough of.

[OUR TAKE]

Necessary for foreign-vocabulary domains, overkill for everything else.

We start with SFT and benchmark. If the gap to a target metric is structural (the base does not know the vocabulary), we add CPT. Otherwise we save the budget.

[READ AT THE SOURCE]

Papers, docs, and primary sources.

[COMMON QUESTIONS]

What buyers ask before they sign.

CPT or SFT for medical terminology?: Try SFT first with domain-rich data. If the model still confuses ICD codes or anatomical terms, CPT on a curated corpus of medical literature is the right next step.
Do we need to extend the vocabulary?: Only when the tokenization is genuinely inefficient on domain text (10x more tokens per character than English). For most enterprise domains, the existing tokenizer is fine.

[RELATED FINE-TUNING TOPICS]

Worth a look next.

02 · FINE-TUNING

Considering Continued pretraining? Let's pressure-test it first.

We benchmark the cheap method first, name the trade, and only deploy the expensive one when the numbers force it. Sized to your data, your evals, your residency.

Start a conversation →All fine-tuning topics