Kensink Labs
Data pipelineDirect LLM · benchmark-firstProduction grade
FINE-TUNING · DATA

The pipeline that feeds production fine-tunes. Sourcing to feedback, eight named stages.

The model is the smaller engineering problem. The data pipeline is where fine-tunes actually live or die. Eight stages, every one with a named tool, every one with a published gotcha.

PresidioDistilabelDEITANeMo CuratorArgillaDVC
Stages
8 (Source · PII · Synth · Score · Dedup · Label · Version · Feedback)
Default PII
Presidio + Nightfall fallback
Quality scoring
DEITA + Nemotron Reward
Versioning
DVC or LakeFS
[THE PIPELINE]

Eight stages. Each with a named tool.

Production data engineering for fine-tuning is mostly about not cutting corners. Each stage has the tool we deploy by default, the trade-off, and what enterprise teams get wrong.

From raw data to signed dataset.

Source, redact, generate, score, dedup, label, version, then feed the loop.

01

Source

Logs + synthetic + human

02

Redact

Presidio + Nightfall fallback

03

Score + dedup

DEITA + SemDedup + MinHash

04

Label

Surge / Argilla

05

Version + sign

DVC / LakeFS + Sigstore

[STAGE BY STAGE]

Named tools, named trades.

01

Sourcing

Production logsHuman curatedSynthetic

Three legitimate sources: production logs (highest signal, biggest PII risk), human-curated (high cost, low volume), synthetic from a frontier model (cheap, diversity-risky). Most enterprise fine-tunes blend all three. Document each source's licensing posture before ingest.

02

PII detection and redaction

PresidioNightfallSkyflowOWASP LLM Top 10

Microsoft Presidio is the open-source default (recognizers, anonymizers, image redaction). Nightfall and Skyflow add managed services with higher recall. OWASP elevated Sensitive Information Disclosure to LLM02 in the 2025 Top Ten. Caveat: 2025 research shows scrubbing alone does not block reconstruction attacks; combine with access control and minimum-necessary scoping.

03

Synthetic data generation

DistilabelSelf-InstructEvol-InstructNemotron Reward

Distilabel (Argilla, BSD) is the pipeline framework: chain LLM calls, judge outputs, serialize to YAML, integrate Argilla for human review. Gretel (now NVIDIA) for tabular. Self-Instruct and Evol-Instruct remain the canonical recipes. NVIDIA Nemotron-4 340B Reward filters quality for open-source pipelines.

04

Quality scoring

DEITANeMo CuratorArgilla scoring

DEITA (ICLR 2024) scores instruction data on complexity, quality, diversity and picks a budgeted subset that often beats training on the full dataset. NeMo Curator is NVIDIA's GPU-accelerated toolkit: exact + fuzzy + semantic dedup, 30+ heuristic filters, classifiers; powers Nemotron-CC.

05

Deduplication

MinHash + LSHSemDedupdatasketch

MinHash + LSH for surface-level near-dupes (NeMo Curator, datasketch). SemDedup (Abbas et al., Meta) for embedding-space semantic dupes. Skipping dedup is the single biggest data quality footgun on first fine-tunes.

06

Labelling vendor

Surge AIScale AIArgilla

Scale AI generated $2B in 2025 but lost OpenAI, Google, Anthropic after Meta's June 2025 $14.3B investment for 49%. Surge AI hit $1.2B ARR (~$25B valuation) and absorbed most of the frontier-lab pipeline. Argilla (now HuggingFace) is the open-source in-house option for teams building their own labelling.

07

Versioning

DVCLakeFSHF Datasets

Every fine-tune commit hash should map to a dataset commit hash. DVC for git-style data. LakeFS for object storage with branches. HuggingFace Datasets for managed and shareable corpora. Pin the version, sign the artifact, never train on a moving target.

08

Feedback capture

LangSmithHeliconeW&B Weave

Production thumbs, edit deltas (what the user changed in the AI output), and rejection events become the seed for the next DPO or KTO round. Tools: LangSmith, Helicone, W&B Weave. The fine-tune is never the last fine-tune.

[WHAT YOU GET]

What's live on every data pipeline we ship.

PII
Redacted at ingest, never raw in training
SemDedup
Embedding-space dedup, not just MinHash
Versioned
DVC or LakeFS, dataset hash signed
Feedback
Thumbs + edits flowing to next round
[COMMON QUESTIONS]

What buyers ask before they sign.

What is the single biggest data mistake we see on incoming projects?
Skipping semantic dedup. MinHash catches near-duplicates at the character level but misses paraphrased duplicates. SemDedup catches them at the embedding level. Without it, the model trains repeatedly on the same idea phrased differently, which inflates loss curves and overfits without warning.
Synthetic data — is it safe to use?
Yes, with discipline. Mix teachers (do not rely on one frontier model), use Evol-Instruct or similar to inject diversity, score with DEITA or a reward model, deduplicate with SemDedup against the real data, and keep a labelled holdout that is never synthetic. The DeepSeek-R1 lineage proves synthetic at scale works when the verifier is right.
How much labelled data do we actually need?
For LoRA SFT: 1k clean examples is a floor, 10k is comfortable. For full SFT: 10k floor, 100k+ comfortable. For DPO: 5k preference pairs, 20k to 50k typical. Quality beats quantity at every stage. Start with the smallest clean set you can label well.
Where does PII redaction sit in the pipeline?
Before any training touch, always. Presidio runs at ingest, every record gets a PII flag, redacted versions go to training, raw versions stay in a access-controlled lake. The training corpus and the production logs are never the same artifact.
FINE-TUNING · DATA · KENSINK LABS

Bring the corpus. We will engineer the pipeline.

Sourcing audit, PII pass, synthetic generation, quality gates, signed dataset versioned for compliance. Sized to the data you have, the data you want, and the residency you need.