Kensink Labs
Model EvaluationDirect LLM · no framework8-week engagement
MODEL EVALUATION · EVAL-FIRST DEVELOPMENT

Write the evals before you write the prompt.

Golden sets, hard assertions, soft LLM-as-judge, drift detection on production traffic. The eval suite is the contract — and every regression closes the gate before users see it.

TypeScriptPromptfooPostgreSQLOpenTelemetry
Cycle
8 weeks · gate-first
Stack
Promptfoo · TypeScript · OpenTelemetry
Output
Eval harness + CI gate + drift detector
Discipline
No prompt ships without a passing eval
[WHY THIS EXISTS]

Vibes are not a release gate.

Most LLM teams ship by feel. A prompt 'seems better,' so it goes out. A model upgrade 'feels smarter,' so the version pin moves. Three weeks later the support queue is on fire and nobody can isolate the regression. Evals make the gate measurable.

  • Golden eval set captured before the first prompt is written
  • Hard assertions on must-pass cases — the build fails if they break
  • Soft LLM-as-judge scoring for quality across edge cases
  • Drift detection on production traffic — alert before users notice
[HOW WE BUILD IT]

Evals as code, gates as discipline.

01

Golden set with the customer

Week-1 workshop: 30–100 must-pass examples mined from real support tickets, sales calls, or operator notes. Versioned alongside the prompt.

02

Hard + soft assertions

Deterministic checks (must contain X, must be valid JSON, must not contain Y) plus LLM-as-judge soft scoring on tone, completeness, and helpfulness.

03

CI gate

Every prompt change runs the suite in CI. Hard assertions failing → build red → no merge. Soft scores tracked over time for drift.

04

Production drift detector

Sampling pipeline pulls a slice of production traffic into the eval harness daily. Alerts fire when the score distribution shifts.

[OUTCOMES AT HANDOFF]

What's live at week eight.

100+
Golden examples covering must-pass paths
CI-gated
No prompt change merges without a green suite
Daily
Drift report on production sample
Days
From regression detected to fix shipped
DIRECT LLM · APPLIED K

Bring the problem.
We’ll bring the build.

Eight weeks, fixed price, eval suite at handoff. Direct LLM engineering on top of the K-Framework. Two Q3 slots remain.