Golden set with the customer
Week-1 workshop: 30–100 must-pass examples mined from real support tickets, sales calls, or operator notes. Versioned alongside the prompt.
Golden sets, hard assertions, soft LLM-as-judge, drift detection on production traffic. The eval suite is the contract — and every regression closes the gate before users see it.
Most LLM teams ship by feel. A prompt 'seems better,' so it goes out. A model upgrade 'feels smarter,' so the version pin moves. Three weeks later the support queue is on fire and nobody can isolate the regression. Evals make the gate measurable.
Week-1 workshop: 30–100 must-pass examples mined from real support tickets, sales calls, or operator notes. Versioned alongside the prompt.
Deterministic checks (must contain X, must be valid JSON, must not contain Y) plus LLM-as-judge soft scoring on tone, completeness, and helpfulness.
Every prompt change runs the suite in CI. Hard assertions failing → build red → no merge. Soft scores tracked over time for drift.
Sampling pipeline pulls a slice of production traffic into the eval harness daily. Alerts fire when the score distribution shifts.