★ Studio framework · v103 pillars · 16 layersOpen · not a SaaS

THE K-FRAMEWORK · A LAYERED MAP OF AI DEVELOPMENT

The K-Framework.

Our operating system for shipping AI products that survive production.

Three pillars, sixteen layers, one feedback loop. Built from the production incidents we’ve shipped through. Every layer is here because we’ve watched something break when it wasn’t. It’s the discipline that separates a system that survives its first month in production from AI slop that demos well and dies in week three.

Apply the K to your build →See it in shipped work →

The K-Framework: a layered map of AI development. A.01 Foundations · Systems Thinking. B.02 Amplification · 10× throughput. C.03 Judgment · Intellectual control.

Pillars

03 · Foundations · Amplification · Judgment

Layers

16 named layers across the K

Loop

Build → Measure → Reflect → Improve

Status

Open framework · used in every engagement

[WHY THIS FRAMEWORK EXISTS]

Most AI projects fail
in the boring layers.

They don’t fail because the models are bad. They fail because the team skips data strategy, skips the eval engine, skips automated rollback, skips intellectual ownership — and ships an LLM call wrapped in a UI.

The K-Framework names every layer that has to exist for an AI product to survive its first production incident. We built it from the production incidents we’ve already shipped through. Each layer is here because we have personally watched something break when it wasn’t.

The recipe for AI slop: ship the model, skip the eval, claim transformative results, take the retainer.
The K is the opposite — every layer is a stop you don’t get to skip.

FOUR FAILURE PATTERNS · WHAT WE’VE WATCHED HAPPEN

ANTI-PATTERN · 01

The model demo that died in week three.

Shipped on a vibes-based prompt. No eval suite. First production user input outside the demo set returned garbage. The team blamed the user.

ANTI-PATTERN · 02

The agent framework migration treadmill.

Bought into a framework. Six months later the framework rewrote its memory layer. The team is doing a migration instead of shipping. Now they're auditing four frameworks for a replacement they'll also migrate off.

ANTI-PATTERN · 03

The token-bill that ate the margin.

Token economics were 'we'll figure it out post-launch'. Post-launch, the per-customer compute cost exceeded the per-customer price. Nobody had instrumented cost-per-intent, so nobody could fix it without rewriting the feature.

ANTI-PATTERN · 04

The integration that needed a human in the loop the team didn't plan for.

Edge cases the eval suite didn't cover. The model hallucinated, the system shipped the hallucination to a customer, the customer screenshot went viral. The team added a human reviewer — at three engineers per shift, the unit economics inverted.

WHAT WE BUILD INSTEAD

Human-adopted AI.
Not AI slop.

Human-adopted means a human owns the decision, the AI accelerates the work, and the system surfaces what humans need to look at. The team that uses the system is sharper because of it — not dependent on it. Not replaced by it.

Real use-case means we’ve named the user, the moment in their workflow, and the outcome we’re changing — before we touch a model API. Anything that fails to specify those three is an experiment, not a product.

The K-Framework forces both. Every layer in Foundations names a human responsibility. Every layer in Amplification names a specific lever (not a buzzword). Every layer in Judgment is about who decides — and how the team gets sharper, not weaker.

[THE THREE PILLARS · MAP VIEW]

Three pillars,
sixteen layers.

The shape of the K is the shape of the framework. The vertical spine is what holds the system up. The two horizontal arms are the leverage and the judgment that make the system sharper over time.

PILLAR · A.01

Foundations

Systems Thinking

The bedrock. Skip any layer here and you build on sand.

6 layers

01System Design
02Data Strategy
03Algorithmic Fundamentals
04Ethics & Safety
05Code as Liability
06Long-Term Vision

JUMP TO PILLAR↓ Read it

PILLAR · B.02

Amplification

10× Throughput

The leverage layer. How a senior team ships ten times faster without skipping discipline.

5 layers

01Model & Tooling
02Automation Layer
03Evaluation Engine
04Token Economics
05Automated Rollback

JUMP TO PILLAR↓ Read it

PILLAR · C.03

Judgment

Intellectual Control

What separates senior teams from junior ones. AI amplifies both correct and incorrect decisions — Judgment makes sure they're correct.

5 layers

01Mentorship Speed-Run
02Architectural Visibility
03Critical Thinking
04Intellectual Ownership
05Values & Purpose

JUMP TO PILLAR↓ Read it

PILLAR A.01 · 06 LAYERS

Foundations.
Systems Thinking.

The bedrock. Skip any layer here and you build on sand.

Foundations are the vertical spine of the K. They’re the layers a project either has or doesn’t — there’s no halfway. A system without data strategy is not a system; it is an experiment in production.

Every engagement opens with a Foundations review. If the client’s existing system is missing a layer, we either add it or we’re honest that we can’t ship a production AI product on top of it yet.

FOUNDATIONS · LAYER 01

System Design.

Architect scalable, resilient systems.

Design every system to survive its own success. "Scalable" is not a buzzword; it's a concrete answer to: what fails first at 10× load, how does the system degrade gracefully, and what does the on-call engineer see at 3 a.m.

★ FAILURE MODE

Architectures designed by importing a reference diagram from a blog post. Works for the demo, falls over the first time the user behaves unexpectedly.

★ WHAT WE DO

Reference architecture is signed off before code lands. Every component has an explicit failure mode and a graceful degradation path. NFRs (availability, performance, recoverability) are written and gated in CI.

FOUNDATIONS · LAYER 02

Data Strategy.

Collect, structure, govern.

Data is the substrate of every AI feature. Schema, lineage, retention, and consent boundaries get defined before we train, fine-tune, or prompt. The fastest LLM in the world cannot recover from bad data discipline.

★ FAILURE MODE

Ingesting whatever's available, hoping the model can clean it up. RAG over a hairball. PII surfacing in completions. Lawyers calling.

★ WHAT WE DO

Data contract first. Lineage tracked from source through retrieval. PII tagged at the schema level and gated in prompt assembly. Retention policy reviewed against the firm's compliance requirements.

FOUNDATIONS · LAYER 03

Algorithmic Fundamentals.

Understand core algorithms.

We don't import-and-pray. Knowing why an algorithm works — and when it doesn't — is non-negotiable. The team that built the system can debug it at the algorithmic level when the abstraction leaks.

★ FAILURE MODE

Treating the LLM as a black box. Stack Overflow-driven retrieval tuning. When the system degrades, nobody can say whether it's the retrieval, the rerank, the prompt, or the model.

★ WHAT WE DO

Engineers can defend every algorithmic choice — embedding model, retrieval strategy, reranker, decoding parameters — back to first principles. Trade-offs are documented in ADRs.

FOUNDATIONS · LAYER 04

Ethics & Safety.

Principles, fairness, privacy, alignment.

Bias audits, PII handling, and alignment evaluations are built into the eval suite from day one — not bolted on after launch when legal asks. Safety is a feature of the architecture, not an afterthought.

★ FAILURE MODE

"We'll do the safety review before launch" — said three months ago, never started. A regulator finds the issue first, and the engagement turns into a remediation project.

★ WHAT WE DO

Safety eval suite alongside the functional one. Red-team prompts in CI. PII detection on inputs + outputs. Alignment checks against the firm's policy stack. Reviewed by the client's InfoSec before each release.

FOUNDATIONS · LAYER 05

Code as Liability.

Test, validate, secure, document.

Every line of code is a liability until it's tested. We write less code, test it more, and document the surface our clients have to maintain. A smaller well-tested codebase is easier to extend than a sprawling one.

★ FAILURE MODE

AI-generated code dropped into the repo without review. Test coverage at 14%. The documentation is the source. The team that takes over after handoff can't extend without rewriting.

★ WHAT WE DO

Strict TypeScript. Tests gate merges. Architecture decision records (ADRs) accompany non-trivial changes. The codebase reads like a manual — because at handoff, it is.

FOUNDATIONS · LAYER 06

Long-Term Vision.

Anticipate impact, design for longevity.

Six-month, eighteen-month, three-year horizons. Decisions taken today are constraints for the team that owns the system in 2027. We name them — out loud, in writing — so the constraint is intentional, not accidental.

★ FAILURE MODE

A stack that locked the team into the agent framework du jour. Two years later the framework is dead, the codebase is unmaintained, and the rebuild costs more than the original engagement.

★ WHAT WE DO

Direct-to-the-model wherever possible. Boring infra (Postgres, queues, edge compute). Choices that age well. Yearly architecture review at the client's request — paid hour, no agenda, just "what would you change today?"

PILLAR B.02 · 05 LAYERS

Amplification.
10× Throughput.

The leverage layer. How a senior team ships ten times faster without skipping discipline.

Amplification is the top arm of the K. It’s where the compounding happens — when the right tools, automated workflows, gated evals, watched economics, and reversible deploys all click together, a senior team doesn’t just ship faster, it ships more confidently.

These five layers are the difference between a team that improvises every release and a team that operates like an actual studio. Each layer is a forcing function for discipline.

AMPLIFICATION · LAYER 01

Model & Tooling.

Choose & build the right tools.

We don't follow framework hype. The right tool for the job is the simplest one that survives the audit. More often than not, that means going directly against the model API — no orchestration vendor, no agent framework, no lock-in to someone else's release schedule.

★ FAILURE MODE

Adopting an agent framework because it's on Twitter. Six months later, the framework rewrites its memory layer in a breaking change. Now half the engineering capacity is a migration project.

★ WHAT WE DO

Tool selection is a decision document, not a vibe. We score against: API stability, vendor lock-in cost, what breaks if the tool dies, debuggability when it does. Most of the time the answer is the model API + Postgres + your existing infra.

AMPLIFICATION · LAYER 02

Automation Layer.

Automate pipelines, workflows, ops.

What gets done by humans twice gets automated. CI is a forcing function, not a chore — every test that runs locally also runs on every PR. Data pipelines, deploys, on-call runbooks: all automated, all version-controlled, all runnable in <5 minutes.

★ FAILURE MODE

Manual deploys. Notion-doc runbooks that drift. A staging environment that doesn't match production. The on-call engineer Google-Doc-ing through the incident.

★ WHAT WE DO

Deploys via PR-driven CI. Runbooks as code (executable, version-controlled). Staging is a parity copy of production. Synthetic monitoring runs the user journey every five minutes — we know the system is broken before the user does.

AMPLIFICATION · LAYER 03

Evaluation Engine.

Measure, benchmark, iterate.

Without an eval suite, AI projects are vibes. We build the eval before the feature. Every release passes through it. Every model swap is gated on it. The eval is the contract — between us and the client, between the team and the production system.

★ FAILURE MODE

"We'll do the eval next sprint." That sprint never arrives. Releases ship based on a single test prompt and the team's intuition. Regressions slip through. The model behaves differently in production than in dev.

★ WHAT WE DO

Labeled holdout set, 500+ examples per intent. Per-field precision/recall. Per-archetype pass rate. A red-team test set. The eval runs on every PR. Production gated on threshold. Threshold reviewed quarterly.

AMPLIFICATION · LAYER 04

Token Economics.

Optimize cost, context, compute.

The unit economics of an LLM-powered feature are not "infinite tokens, never look at the bill." We model cost-per-intent, alert on regression, and rotate models when a smaller model saves 60% with no quality drop. Cost discipline is a feature of the architecture.

★ FAILURE MODE

Bill arrives at end of month. Nobody knew the prompt grew to 18k tokens because of a prompt-template change. Per-customer compute cost has exceeded per-customer revenue for three weeks. The fix is rewriting the prompt assembly layer.

★ WHAT WE DO

Cost-per-call metered. Cost-per-intent budget per customer. Alerts on cost-regression PRs. Quarterly review of model choice — newer cheaper model? Distilled local fallback? RAG over prompt-stuffing?

AMPLIFICATION · LAYER 05

Automated Rollback.

Safe deploys, reversible changes.

Deploys without rollback are bets. Every release is paired with a rollback path that someone has actually tested. Feature flags for the things you want to half-ship. Schema migrations that work both directions. The team can recover from a bad deploy in under 5 minutes.

★ FAILURE MODE

Bad deploy at 2 a.m. The rollback is a Confluence page nobody has run in a year. The migration is forward-only. The team is hand-editing the database in production. Customer trust takes the hit.

★ WHAT WE DO

Blue/green or canary deploys. Feature flags as first-class infra. Schema migrations validated both directions in CI. A rollback procedure that's been run on staging the same week as launch.

PILLAR C.03 · 05 LAYERS

Judgment.
Intellectual Control.

What separates senior teams from junior ones. AI amplifies both correct and incorrect decisions — Judgment is the layer that makes sure they’re correct.

Judgment is the bottom arm of the K. It’s the layer most vendors skip — and it’s the layer that matters most when your team takes over after handoff. Tools amplify; judgment decides.

The five layers below are what we hand over alongside the code. The team that takes over at week 8 has the visibility, the framing, the ownership, and the purpose to extend the system. Without these, the engagement is a temporary win — a system that needs us to be maintained.

JUDGMENT · LAYER 01

Mentorship Speed-Run.

Learn from experts, compress experience.

Senior engineers compress six months of mistakes into six weeks of guidance. The studio pairs your team with engineers who have shipped the system you're trying to build — and the team that takes over after handoff is sharper, not dependent.

★ FAILURE MODE

Outsourcing the engagement to a black-box vendor. The vendor ships and leaves. Your team can't extend, can't debug, can't make the next call. The dependency is the deliverable.

★ WHAT WE DO

Pair-engineering by default. Architecture decisions co-authored with your team. Weekly walkthroughs of why a choice was made. By week 8, your engineers can defend every line.

JUDGMENT · LAYER 02

Architectural Visibility.

See the big picture, make better calls.

You can't optimize a system you can't see. Architecture diagrams are first-class artifacts on every engagement — kept up to date in version control, referenced in every code review, the substrate for every architecture review board meeting.

★ FAILURE MODE

The architecture lives in the head of the original engineer. They leave. The team can't see how the pieces fit. Every change is a guess. Refactors are afraid-driven instead of evidence-driven.

★ WHAT WE DO

Architecture as code: diagrams committed to the repo, generated from real config where possible. Service catalogs. Data-flow maps. ADRs (architecture decision records) for every non-trivial choice. The on-call runbook references the actual diagram.

JUDGMENT · LAYER 03

Critical Thinking.

Question assumptions, stress-test ideas.

“Why this model? Why this latency target? Why this metric?” — asked at every architecture review. The team that builds the most resilient systems is the one that's most uncomfortable with handed-down assumptions.

★ FAILURE MODE

Adopting a metric because the previous team used it. Optimizing for latency at the wrong percentile. Picking a model because it's the default. The system becomes a museum of inherited decisions nobody can defend.

★ WHAT WE DO

Each architecture review starts with a re-derivation: why this approach, why now, what would change if assumptions shift. The team that can articulate why something works is the team that can fix it when it doesn't.

JUDGMENT · LAYER 04

Intellectual Ownership.

Own decisions, own outcomes.

No “the framework made me do it.” Every architectural decision has a name attached, a reason recorded, and a path to revisit it. When something breaks, the question isn't whose fault — it's whose decision, and what would we change next time.

★ FAILURE MODE

Decisions made by committee with no clear owner. When the system fails, nobody can say who chose the model, the framework, the schema. The blame is diffuse — the fix is nowhere.

★ WHAT WE DO

ADRs name the decision-maker. Reviews have a designated questioner. Post-incident reviews are about decisions, not people. The pattern compounds: a team that owns its decisions makes better ones.

JUDGMENT · LAYER 05

Values & Purpose.

Stay aligned with impact & purpose.

The hardest constraint: what are we actually doing this for? Restated weekly so the team doesn't ship features that nobody needs, doesn't ship AI that has no real use-case, doesn't ship optimization on the wrong dimension.

★ FAILURE MODE

Shipping AI features because AI is hot. Adding capabilities the user didn't ask for. Optimizing model accuracy when the actual constraint is latency. The team is busy; the user is unchanged.

★ WHAT WE DO

Engagement charter signed week 1: who is the user, what moment in their workflow, what outcome are we changing. Re-read out loud at each weekly check-in. Anything that doesn't serve the charter doesn't ship in the engagement.

[THE COMPOUND GROWTH LOOP]

Not a checklist.
A loop you run.

The K-Framework is not a list you tick once. It’s a loop we run every cycle. Every release passes through the loop. The compounding is the thing — small improvements per cycle, multiplied across the engagement, is how a senior team ships faster than a vendor with three times the headcount.

01 / 04STEP 01

Build.

We ship a thing.

Working code in front of real users. Not a prototype that lives in Figma — a feature, behind a flag, with metrics wired in before launch.

NEXT→ Measure

02 / 04STEP 02

Measure.

We watch what happens.

Eval suite runs. Cost-per-call landed where? Latency drifted? Did the user actually use the feature, or just discover it and leave? Numbers, not vibes.

NEXT→ Reflect

03 / 04STEP 03

Reflect.

We argue about why.

Whole team in the room. What did the numbers tell us? What did they NOT tell us? What was the surprise — good or bad? Reflection beats velocity at the system level.

NEXT→ Improve

04 / 04STEP 04

Improve.

We change the thing.

Smaller prompt, cheaper model, better eval set, more aggressive caching, less aggressive automation — whichever the reflection pointed at. Then back to Build.

NEXT→ Build

WHY A LOOP, NOT A LIST

Linear processes ship features.
Loops ship better systems.

A linear process produces one version of a thing, ships it, calls it done. A loop produces a thing, watches what happens, changes the thing, watches again. The team gets sharper. The system gets smarter. The cost-per-customer gets lower.

This is the same compounding that distinguishes a quarterly software release from a continuously-deployed system. We just extend the discipline up the stack — to the prompt, the eval, the model choice, the architecture. Every layer of the K participates in the loop.

[AI DEVELOPMENT SCOPE · 06 STAGES]

Six stages.
No skipping.

Every engagement moves through these six stages, gated by the relevant K-Framework layers. We don’t skip stages. Don’t-skip is the whole engineering practice — the shortcut is what creates AI slop.

STAGE · 01 / 06

Identify Problem.

Name the user, the moment, and the outcome you want to change. If we can't name those three, we don't have a problem yet — we have an idea looking for a use case.

Artifact

Engagement charter (1 page).

K Layers

C.05 Values & Purpose · C.04 Intellectual Ownership

STAGE · 02 / 06

Research & Framing.

What's already been tried? What's the constraint? Where does the cost-per-decision live in the existing workflow? Most projects fail because they skip framing and start coding.

Artifact

Research memo · constraint map · cost-per-decision baseline.

K Layers

A.02 Data Strategy · C.03 Critical Thinking

STAGE · 03 / 06

Design & Prototype.

Reference architecture signed off. Eval suite drafted (before the feature exists). Prototype that runs end-to-end on a tiny dataset — to validate the system shape, not the model quality.

Artifact

Reference architecture · eval suite v0 · runnable prototype.

K Layers

A.01 System Design · B.03 Evaluation Engine

STAGE · 04 / 06

Build & Track.

Iterative build, weekly demos, eval gating every PR. The cost-per-intent dashboard goes live in week one. Architecture decisions are recorded as ADRs as they're made — not after.

Artifact

Working app · CI-gated eval · cost dashboard · ADRs.

K Layers

B.01 Model & Tooling · B.02 Automation Layer · A.05 Code as Liability

STAGE · 05 / 06

Deploy & Monitor.

Blue/green or canary deploy. Feature flags for partial rollout. Synthetic monitoring on the user journey. Cost alerts. Eval-pass-rate dashboard in the client's stakeholder Slack.

Artifact

Production rollout · observability dashboard · rollback path.

K Layers

B.05 Automated Rollback · A.04 Ethics & Safety · C.02 Architectural Visibility

STAGE · 06 / 06

Learn & Evolve.

Quarterly retro on the production system. What did the eval suite miss? What new edge cases emerged? What's the next layer of the K we lean on next quarter? The system improves while the original engagement is closed.

Artifact

Quarterly review · eval suite v+1 · architecture review board notes.

K Layers

Compound Loop (full).

[COMPLEXITY MULTIPLIERS · 04 DIMENSIONS]

The four dimensions
that compound difficulty.

Most “transformative AI” pitches ignore the multipliers and quote a baseline. We name them up front, score the engagement against them, and adjust the plan accordingly. A project that looks like an 8-week sprint can be a 16-week one if three of these are high — and we’d rather know that in week 1 than week 7.

×01

MULTIPLIER 01

Domain Complexity.

How weird is the business logic? Are there edge cases nobody on the team can explain without consulting a SME?

Signals it’s high

·Legal, medical, financial, or compliance-heavy domain
·More than 12 user roles with different permissions
·Workflows that depend on the day of the week or season
·A spreadsheet that nobody can decode but everyone depends on

★ WHAT WE DO

Co-author a domain glossary with the SME in week 1. Hard-name the edge cases that will trip the model. Adjust the eval suite to cover them before the feature exists.

×02

MULTIPLIER 02

Data Complexity.

How messy is the input? How much cleaning, schema migration, or human curation does the data need before the model can do anything useful?

Signals it’s high

·Data lives in 3+ systems with inconsistent schemas
·Free-text fields that contain critical structured information
·Ground truth requires human labeling — and the labelers disagree
·PII surfaces across the data plane and needs gating

★ WHAT WE DO

Data contract first. Lineage tracked end-to-end. Labelling protocol designed with the SME. The eval set is the source of truth for what 'correct' means — and the team that builds it owns the contract.

×03

MULTIPLIER 03

System Complexity.

How many surfaces does the system touch? How many integrations need to stay coordinated for the feature to work end-to-end?

Signals it’s high

·Touches 5+ systems (CRM, billing, ERP, support, analytics)
·Spans web + mobile + email + back-office in the same flow
·Has both a customer-facing surface AND a regulatory reporting surface
·Inherits an integration debt from a prior vendor

★ WHAT WE DO

Architecture review board cadence. Service catalog up to date. Integration tests on every surface. Feature flags for partial rollout. Synthetic monitoring covers the user journey, not just the endpoints.

×04

MULTIPLIER 04

Human Complexity.

How many humans need to coordinate? How many stakeholders have veto authority? How many time zones is the team in?

Signals it’s high

·More than 4 stakeholders with conflicting priorities
·Cross-team work (engineering + product + legal + ops)
·Distributed team across 3+ time zones
·A 'change committee' that meets monthly

★ WHAT WE DO

Engagement charter signed by the actual decision-maker, not the proxy. RAID log shared with the client weekly. Async-first communication, with one synchronous demo per week. Architecture review board for the cross-team escalation channel.

THE K-FRAMEWORK · IN PRACTICE

AI development is not linear.
It’s a multi-dimensional
system.

Master the K. Multiply impact.

The framework is open. The discipline isn’t. If you want human-adopted AI on top of a system your team can extend — not AI slop on top of a system that needs us to maintain it — bring us the real problem. We’ll bring the K.

Apply the K to your build →Book a 30-min intro

A.01Foundations.Systems Thinking

B.02Amplification.10× Throughput

C.03Judgment.Intellectual Control

The K-Framework.

Most AI projects failin the boring layers.

The model demo that died in week three.

The agent framework migration treadmill.

The token-bill that ate the margin.

The integration that needed a human in the loop the team didn't plan for.

Human-adopted AI.Not AI slop.

Three pillars,sixteen layers.

Foundations.Systems Thinking.

System Design.

Data Strategy.

Algorithmic Fundamentals.

Ethics & Safety.

Code as Liability.

Long-Term Vision.

Amplification.10× Throughput.

Model & Tooling.

Automation Layer.

Evaluation Engine.

Token Economics.

Automated Rollback.

Judgment.Intellectual Control.

Mentorship Speed-Run.

Architectural Visibility.

Critical Thinking.

Intellectual Ownership.

Values & Purpose.

Not a checklist.A loop you run.

Build.

Measure.

Reflect.

Improve.

Linear processes ship features.Loops ship better systems.

Six stages.No skipping.

Identify Problem.

Research & Framing.

Design & Prototype.

Build & Track.

Deploy & Monitor.

Learn & Evolve.

The four dimensionsthat compound difficulty.

Domain Complexity.

Data Complexity.

System Complexity.

Human Complexity.

AI development is not linear.It’s a multi-dimensionalsystem.

Most AI projects fail
in the boring layers.

Human-adopted AI.
Not AI slop.

Three pillars,
sixteen layers.

Foundations.
Systems Thinking.

Amplification.
10× Throughput.

Judgment.
Intellectual Control.

Not a checklist.
A loop you run.

Linear processes ship features.
Loops ship better systems.

Six stages.
No skipping.

The four dimensions
that compound difficulty.

AI development is not linear.
It’s a multi-dimensional
system.