★ Model EvaluationDirect LLM · no framework8-week engagement

MODEL EVALUATION · EVAL-FIRST DEVELOPMENT

Write the evals before you write the prompt.

Golden sets, hard assertions, soft LLM-as-judge, drift detection on production traffic. The eval suite is the contract — and every regression closes the gate before users see it.

TypeScriptPromptfooPostgreSQLOpenTelemetry

Start this engagement →All LLM services →

Cycle

8 weeks · gate-first

Stack

Promptfoo · TypeScript · OpenTelemetry

Output

Eval harness + CI gate + drift detector

Discipline

No prompt ships without a passing eval

[WHY THIS EXISTS]

Vibes are not a release gate.

Most LLM teams ship by feel. A prompt 'seems better,' so it goes out. A model upgrade 'feels smarter,' so the version pin moves. Three weeks later the support queue is on fire and nobody can isolate the regression. Evals make the gate measurable.

Golden eval set captured before the first prompt is written
Hard assertions on must-pass cases — the build fails if they break
Soft LLM-as-judge scoring for quality across edge cases
Drift detection on production traffic — alert before users notice

[HOW WE BUILD IT]

Evals as code, gates as discipline.

01

Golden set with the customer

Week-1 workshop: 30–100 must-pass examples mined from real support tickets, sales calls, or operator notes. Versioned alongside the prompt.

02

Hard + soft assertions

Deterministic checks (must contain X, must be valid JSON, must not contain Y) plus LLM-as-judge soft scoring on tone, completeness, and helpfulness.

03

CI gate

Every prompt change runs the suite in CI. Hard assertions failing → build red → no merge. Soft scores tracked over time for drift.

04

Production drift detector

Sampling pipeline pulls a slice of production traffic into the eval harness daily. Alerts fire when the score distribution shifts.

[OUTCOMES AT HANDOFF]

What's live at week eight.

100+

Golden examples covering must-pass paths

CI-gated

No prompt change merges without a green suite

Daily

Drift report on production sample

Days

From regression detected to fix shipped

[ALSO WORTH READING]

Related LLM engagements.

FEEDBACK TRAINING

Read the engagement

LLM Observability

Read the engagement

PRODUCTION AGENTS

Production agents

Read the engagement

DIRECT LLM · APPLIED K

Bring the problem.
We’ll bring the build.

Eight weeks, fixed price, eval suite at handoff. Direct LLM engineering on top of the K-Framework. Two Q3 slots remain.

Start this engagement →Read the K-Framework