The model demo that died in week three.
Shipped on a vibes-based prompt. No eval suite. First production user input outside the demo set returned garbage. The team blamed the user.
Our operating system for shipping AI products that survive production.
Three pillars, sixteen layers, one feedback loop. Built from the production incidents we’ve shipped through. Every layer is here because we’ve watched something break when it wasn’t. It’s the discipline that separates a system that survives its first month in production from AI slop that demos well and dies in week three.

They don’t fail because the models are bad. They fail because the team skips data strategy, skips the eval engine, skips automated rollback, skips intellectual ownership — and ships an LLM call wrapped in a UI.
The K-Framework names every layer that has to exist for an AI product to survive its first production incident. We built it from the production incidents we’ve already shipped through. Each layer is here because we have personally watched something break when it wasn’t.
The recipe for AI slop: ship the model, skip the eval, claim transformative results, take the retainer.
The K is the opposite — every layer is a stop you don’t get to skip.
Shipped on a vibes-based prompt. No eval suite. First production user input outside the demo set returned garbage. The team blamed the user.
Bought into a framework. Six months later the framework rewrote its memory layer. The team is doing a migration instead of shipping. Now they're auditing four frameworks for a replacement they'll also migrate off.
Token economics were 'we'll figure it out post-launch'. Post-launch, the per-customer compute cost exceeded the per-customer price. Nobody had instrumented cost-per-intent, so nobody could fix it without rewriting the feature.
Edge cases the eval suite didn't cover. The model hallucinated, the system shipped the hallucination to a customer, the customer screenshot went viral. The team added a human reviewer — at three engineers per shift, the unit economics inverted.
Human-adopted means a human owns the decision, the AI accelerates the work, and the system surfaces what humans need to look at. The team that uses the system is sharper because of it — not dependent on it. Not replaced by it.
Real use-case means we’ve named the user, the moment in their workflow, and the outcome we’re changing — before we touch a model API. Anything that fails to specify those three is an experiment, not a product.
The K-Framework forces both. Every layer in Foundations names a human responsibility. Every layer in Amplification names a specific lever (not a buzzword). Every layer in Judgment is about who decides — and how the team gets sharper, not weaker.
The shape of the K is the shape of the framework. The vertical spine is what holds the system up. The two horizontal arms are the leverage and the judgment that make the system sharper over time.
The bedrock. Skip any layer here and you build on sand.
The leverage layer. How a senior team ships ten times faster without skipping discipline.
What separates senior teams from junior ones. AI amplifies both correct and incorrect decisions — Judgment makes sure they're correct.
The bedrock. Skip any layer here and you build on sand.
Foundations are the vertical spine of the K. They’re the layers a project either has or doesn’t — there’s no halfway. A system without data strategy is not a system; it is an experiment in production.
Every engagement opens with a Foundations review. If the client’s existing system is missing a layer, we either add it or we’re honest that we can’t ship a production AI product on top of it yet.
Architect scalable, resilient systems.
Design every system to survive its own success. "Scalable" is not a buzzword; it's a concrete answer to: what fails first at 10× load, how does the system degrade gracefully, and what does the on-call engineer see at 3 a.m.
Architectures designed by importing a reference diagram from a blog post. Works for the demo, falls over the first time the user behaves unexpectedly.
Reference architecture is signed off before code lands. Every component has an explicit failure mode and a graceful degradation path. NFRs (availability, performance, recoverability) are written and gated in CI.
Collect, structure, govern.
Data is the substrate of every AI feature. Schema, lineage, retention, and consent boundaries get defined before we train, fine-tune, or prompt. The fastest LLM in the world cannot recover from bad data discipline.
Ingesting whatever's available, hoping the model can clean it up. RAG over a hairball. PII surfacing in completions. Lawyers calling.
Data contract first. Lineage tracked from source through retrieval. PII tagged at the schema level and gated in prompt assembly. Retention policy reviewed against the firm's compliance requirements.
Understand core algorithms.
We don't import-and-pray. Knowing why an algorithm works — and when it doesn't — is non-negotiable. The team that built the system can debug it at the algorithmic level when the abstraction leaks.
Treating the LLM as a black box. Stack Overflow-driven retrieval tuning. When the system degrades, nobody can say whether it's the retrieval, the rerank, the prompt, or the model.
Engineers can defend every algorithmic choice — embedding model, retrieval strategy, reranker, decoding parameters — back to first principles. Trade-offs are documented in ADRs.
Principles, fairness, privacy, alignment.
Bias audits, PII handling, and alignment evaluations are built into the eval suite from day one — not bolted on after launch when legal asks. Safety is a feature of the architecture, not an afterthought.
"We'll do the safety review before launch" — said three months ago, never started. A regulator finds the issue first, and the engagement turns into a remediation project.
Safety eval suite alongside the functional one. Red-team prompts in CI. PII detection on inputs + outputs. Alignment checks against the firm's policy stack. Reviewed by the client's InfoSec before each release.
Test, validate, secure, document.
Every line of code is a liability until it's tested. We write less code, test it more, and document the surface our clients have to maintain. A smaller well-tested codebase is easier to extend than a sprawling one.
AI-generated code dropped into the repo without review. Test coverage at 14%. The documentation is the source. The team that takes over after handoff can't extend without rewriting.
Strict TypeScript. Tests gate merges. Architecture decision records (ADRs) accompany non-trivial changes. The codebase reads like a manual — because at handoff, it is.
Anticipate impact, design for longevity.
Six-month, eighteen-month, three-year horizons. Decisions taken today are constraints for the team that owns the system in 2027. We name them — out loud, in writing — so the constraint is intentional, not accidental.
A stack that locked the team into the agent framework du jour. Two years later the framework is dead, the codebase is unmaintained, and the rebuild costs more than the original engagement.
Direct-to-the-model wherever possible. Boring infra (Postgres, queues, edge compute). Choices that age well. Yearly architecture review at the client's request — paid hour, no agenda, just "what would you change today?"
The leverage layer. How a senior team ships ten times faster without skipping discipline.
Amplification is the top arm of the K. It’s where the compounding happens — when the right tools, automated workflows, gated evals, watched economics, and reversible deploys all click together, a senior team doesn’t just ship faster, it ships more confidently.
These five layers are the difference between a team that improvises every release and a team that operates like an actual studio. Each layer is a forcing function for discipline.
Choose & build the right tools.
We don't follow framework hype. The right tool for the job is the simplest one that survives the audit. More often than not, that means going directly against the model API — no orchestration vendor, no agent framework, no lock-in to someone else's release schedule.
Adopting an agent framework because it's on Twitter. Six months later, the framework rewrites its memory layer in a breaking change. Now half the engineering capacity is a migration project.
Tool selection is a decision document, not a vibe. We score against: API stability, vendor lock-in cost, what breaks if the tool dies, debuggability when it does. Most of the time the answer is the model API + Postgres + your existing infra.
Automate pipelines, workflows, ops.
What gets done by humans twice gets automated. CI is a forcing function, not a chore — every test that runs locally also runs on every PR. Data pipelines, deploys, on-call runbooks: all automated, all version-controlled, all runnable in <5 minutes.
Manual deploys. Notion-doc runbooks that drift. A staging environment that doesn't match production. The on-call engineer Google-Doc-ing through the incident.
Deploys via PR-driven CI. Runbooks as code (executable, version-controlled). Staging is a parity copy of production. Synthetic monitoring runs the user journey every five minutes — we know the system is broken before the user does.
Measure, benchmark, iterate.
Without an eval suite, AI projects are vibes. We build the eval before the feature. Every release passes through it. Every model swap is gated on it. The eval is the contract — between us and the client, between the team and the production system.
"We'll do the eval next sprint." That sprint never arrives. Releases ship based on a single test prompt and the team's intuition. Regressions slip through. The model behaves differently in production than in dev.
Labeled holdout set, 500+ examples per intent. Per-field precision/recall. Per-archetype pass rate. A red-team test set. The eval runs on every PR. Production gated on threshold. Threshold reviewed quarterly.
Optimize cost, context, compute.
The unit economics of an LLM-powered feature are not "infinite tokens, never look at the bill." We model cost-per-intent, alert on regression, and rotate models when a smaller model saves 60% with no quality drop. Cost discipline is a feature of the architecture.
Bill arrives at end of month. Nobody knew the prompt grew to 18k tokens because of a prompt-template change. Per-customer compute cost has exceeded per-customer revenue for three weeks. The fix is rewriting the prompt assembly layer.
Cost-per-call metered. Cost-per-intent budget per customer. Alerts on cost-regression PRs. Quarterly review of model choice — newer cheaper model? Distilled local fallback? RAG over prompt-stuffing?
Safe deploys, reversible changes.
Deploys without rollback are bets. Every release is paired with a rollback path that someone has actually tested. Feature flags for the things you want to half-ship. Schema migrations that work both directions. The team can recover from a bad deploy in under 5 minutes.
Bad deploy at 2 a.m. The rollback is a Confluence page nobody has run in a year. The migration is forward-only. The team is hand-editing the database in production. Customer trust takes the hit.
Blue/green or canary deploys. Feature flags as first-class infra. Schema migrations validated both directions in CI. A rollback procedure that's been run on staging the same week as launch.
What separates senior teams from junior ones. AI amplifies both correct and incorrect decisions — Judgment is the layer that makes sure they’re correct.
Judgment is the bottom arm of the K. It’s the layer most vendors skip — and it’s the layer that matters most when your team takes over after handoff. Tools amplify; judgment decides.
The five layers below are what we hand over alongside the code. The team that takes over at week 8 has the visibility, the framing, the ownership, and the purpose to extend the system. Without these, the engagement is a temporary win — a system that needs us to be maintained.
Learn from experts, compress experience.
Senior engineers compress six months of mistakes into six weeks of guidance. The studio pairs your team with engineers who have shipped the system you're trying to build — and the team that takes over after handoff is sharper, not dependent.
Outsourcing the engagement to a black-box vendor. The vendor ships and leaves. Your team can't extend, can't debug, can't make the next call. The dependency is the deliverable.
Pair-engineering by default. Architecture decisions co-authored with your team. Weekly walkthroughs of why a choice was made. By week 8, your engineers can defend every line.
See the big picture, make better calls.
You can't optimize a system you can't see. Architecture diagrams are first-class artifacts on every engagement — kept up to date in version control, referenced in every code review, the substrate for every architecture review board meeting.
The architecture lives in the head of the original engineer. They leave. The team can't see how the pieces fit. Every change is a guess. Refactors are afraid-driven instead of evidence-driven.
Architecture as code: diagrams committed to the repo, generated from real config where possible. Service catalogs. Data-flow maps. ADRs (architecture decision records) for every non-trivial choice. The on-call runbook references the actual diagram.
Question assumptions, stress-test ideas.
“Why this model? Why this latency target? Why this metric?” — asked at every architecture review. The team that builds the most resilient systems is the one that's most uncomfortable with handed-down assumptions.
Adopting a metric because the previous team used it. Optimizing for latency at the wrong percentile. Picking a model because it's the default. The system becomes a museum of inherited decisions nobody can defend.
Each architecture review starts with a re-derivation: why this approach, why now, what would change if assumptions shift. The team that can articulate why something works is the team that can fix it when it doesn't.
Own decisions, own outcomes.
No “the framework made me do it.” Every architectural decision has a name attached, a reason recorded, and a path to revisit it. When something breaks, the question isn't whose fault — it's whose decision, and what would we change next time.
Decisions made by committee with no clear owner. When the system fails, nobody can say who chose the model, the framework, the schema. The blame is diffuse — the fix is nowhere.
ADRs name the decision-maker. Reviews have a designated questioner. Post-incident reviews are about decisions, not people. The pattern compounds: a team that owns its decisions makes better ones.
Stay aligned with impact & purpose.
The hardest constraint: what are we actually doing this for? Restated weekly so the team doesn't ship features that nobody needs, doesn't ship AI that has no real use-case, doesn't ship optimization on the wrong dimension.
Shipping AI features because AI is hot. Adding capabilities the user didn't ask for. Optimizing model accuracy when the actual constraint is latency. The team is busy; the user is unchanged.
Engagement charter signed week 1: who is the user, what moment in their workflow, what outcome are we changing. Re-read out loud at each weekly check-in. Anything that doesn't serve the charter doesn't ship in the engagement.
The K-Framework is not a list you tick once. It’s a loop we run every cycle. Every release passes through the loop. The compounding is the thing — small improvements per cycle, multiplied across the engagement, is how a senior team ships faster than a vendor with three times the headcount.
We ship a thing.
Working code in front of real users. Not a prototype that lives in Figma — a feature, behind a flag, with metrics wired in before launch.
We watch what happens.
Eval suite runs. Cost-per-call landed where? Latency drifted? Did the user actually use the feature, or just discover it and leave? Numbers, not vibes.
We argue about why.
Whole team in the room. What did the numbers tell us? What did they NOT tell us? What was the surprise — good or bad? Reflection beats velocity at the system level.
We change the thing.
Smaller prompt, cheaper model, better eval set, more aggressive caching, less aggressive automation — whichever the reflection pointed at. Then back to Build.
A linear process produces one version of a thing, ships it, calls it done. A loop produces a thing, watches what happens, changes the thing, watches again. The team gets sharper. The system gets smarter. The cost-per-customer gets lower.
This is the same compounding that distinguishes a quarterly software release from a continuously-deployed system. We just extend the discipline up the stack — to the prompt, the eval, the model choice, the architecture. Every layer of the K participates in the loop.
Every engagement moves through these six stages, gated by the relevant K-Framework layers. We don’t skip stages. Don’t-skip is the whole engineering practice — the shortcut is what creates AI slop.
Name the user, the moment, and the outcome you want to change. If we can't name those three, we don't have a problem yet — we have an idea looking for a use case.
What's already been tried? What's the constraint? Where does the cost-per-decision live in the existing workflow? Most projects fail because they skip framing and start coding.
Reference architecture signed off. Eval suite drafted (before the feature exists). Prototype that runs end-to-end on a tiny dataset — to validate the system shape, not the model quality.
Iterative build, weekly demos, eval gating every PR. The cost-per-intent dashboard goes live in week one. Architecture decisions are recorded as ADRs as they're made — not after.
Blue/green or canary deploy. Feature flags for partial rollout. Synthetic monitoring on the user journey. Cost alerts. Eval-pass-rate dashboard in the client's stakeholder Slack.
Quarterly retro on the production system. What did the eval suite miss? What new edge cases emerged? What's the next layer of the K we lean on next quarter? The system improves while the original engagement is closed.
Most “transformative AI” pitches ignore the multipliers and quote a baseline. We name them up front, score the engagement against them, and adjust the plan accordingly. A project that looks like an 8-week sprint can be a 16-week one if three of these are high — and we’d rather know that in week 1 than week 7.
How weird is the business logic? Are there edge cases nobody on the team can explain without consulting a SME?
Co-author a domain glossary with the SME in week 1. Hard-name the edge cases that will trip the model. Adjust the eval suite to cover them before the feature exists.
How messy is the input? How much cleaning, schema migration, or human curation does the data need before the model can do anything useful?
Data contract first. Lineage tracked end-to-end. Labelling protocol designed with the SME. The eval set is the source of truth for what 'correct' means — and the team that builds it owns the contract.
How many surfaces does the system touch? How many integrations need to stay coordinated for the feature to work end-to-end?
Architecture review board cadence. Service catalog up to date. Integration tests on every surface. Feature flags for partial rollout. Synthetic monitoring covers the user journey, not just the endpoints.
How many humans need to coordinate? How many stakeholders have veto authority? How many time zones is the team in?
Engagement charter signed by the actual decision-maker, not the proxy. RAID log shared with the client weekly. Async-first communication, with one synchronous demo per week. Architecture review board for the cross-team escalation channel.
Master the K. Multiply impact.