★ Direct LLM LabLab open · 2 slots Q38 services · one teamNo framework lock-in

DIRECT LLM ENGINEERING · EST. 2024

Direct LLM. No frameworks on quicksand.

Senior engineers integrating against the model API the same way we integrate against Postgres. No LangChain, no LlamaIndex, no agent framework that needs a migration every six months. Eight weeks from problem to live, eval suite included, full source ownership at handoff.

Start a build →See the K-Framework →

Cycle

8 weeks · problem to live

Stack

Direct API · no LangChain, no LlamaIndex

Output

Code + eval suite + runbook

Framework

Applied K-Framework on every build

[THREE PRINCIPLES · ONE LAB]

How direct LLM work
actually stays shipped.

These three rules are why our LLM builds survive their first month in production. Skip any one and you’ve shipped AI slop on a timer.

Direct API integration

We call the model the same way we call Postgres. No wrapper SDK, no agent framework, no graph DSL. If you can read the OpenAI cookbook, you can read our code.

OpenAI, Anthropic, Google, or local. Your model choice
TypeScript types from the API spec, not a third-party abstraction
Retry, fallback, timeout logic written by us, owned by you

Evals before features

We write a golden eval set before we ship a prompt. Every regression closes the gate. Production traffic feeds the next eval cycle. That's the Compound Growth Loop.

10–100 golden examples per task, version-controlled with the prompt
Hard assertions on must-pass cases, soft scoring on quality
Drift detection in production, with alerts before users see it

Boring infrastructure

Postgres, pgvector, Cloudflare Workers. Tools your team already runs, your ops team already monitors, your CTO already approved. No new vendors to onboard.

Postgres + pgvector for retrieval, no separate vector DB
Cloudflare Workers for inference proxy + cost guardrails
Sentry + OpenTelemetry for traces, on your existing observability

APPLIED FRAMEWORK

Every LLM build runs on the K-Framework.

Three pillars, sixteen layers, one feedback loop. The discipline that separates a system that survives production from a demo that dies in week three. Foundations · Amplification · Judgment, applied to every prompt, every retrieval, every eval gate.

Read the K-Framework

The K-Framework: a layered map of AI development. Three pillars (Foundations, Amplification, Judgment) across sixteen named layers.

[EIGHT SERVICES · ONE LAB]

Pick the LLM problem.
We’ll bring the build.

Each service below is a focused eight-week sprint with fixed scope and an eval suite at handoff. Bundle two or three if the problem warrants, or sequence them as a multi-phase program for regulated builds.

SERVICE · 01 / 08

Enterprise LLM

Security + governance

Production LLM that passes legal, security, and procurement on the same go-live date.

SSO + RBAC + audit trails baked in
Vendor-neutral abstraction, so you swap models without rewriting
Data-residency + PII policy enforced at the proxy

TypeScriptPostgresOpenAIAnthropic

See the engagement

SERVICE · 02 / 08

On-premise / private LLM

Self-hosted inference

Run your own weights in your own VPC. Latency, cost, and compliance stay under your control.

vLLM + Triton for production-grade throughput
GPU sizing + autoscaling that doesn't melt your finance team
Air-gapped deployments where required

PythonvLLMTritonLlama

See the engagement

SERVICE · 03 / 08

Model evaluation

Eval-first development

A golden eval suite before a single prompt ships. Regressions close the gate, not the user.

Golden sets + hard assertions + soft LLM-as-judge
Drift detection on production traffic
A/B prompt tests with statistical significance gates

TypeScriptPromptfooOpenTelemetryPostgreSQL

See the engagement

SERVICE · 04 / 08

Feedback training & fine-tuning

When RAG isn't enough

LoRA, DPO, or full fine-tune. We pick based on data volume, not vendor pitch.

RAG vs fine-tuning audit before any training spend
Feedback capture pipeline → labeled dataset → eval gate
LoRA adapters you can hot-swap per customer

PythonPyTorchLoRAHuggingFace

Hybrid retrieval on Postgres + pgvector. No separate vector DB, no five-system synchronisation problem.

Pgvector + BM25 hybrid for recall and precision both
Citation-first answers, so every claim links to its chunk
Chunking strategy tuned to your corpus, not someone's blog post

PostgreSQLpgvectorTypeScriptOpenAI

Function-calling agents with hard guardrails. Crossreferences /ai-agents from the same lab, with a deeper engineering view.

Schema-validated tool calls, no JSON parse roulette
Per-tool rate limits + cost guardrails
Observable agent traces: every loop, every retry, every cost

TypeScriptAnthropicZodOpenTelemetry

See the engagement

SERVICE · 07 / 08

Observability & cost

Telemetry from day one

Token telemetry, drift detection, cost-per-conversation dashboards. What gets measured gets shipped.

Token + cost telemetry per user, per endpoint, per prompt version
Drift alerts before the user notices
Per-tenant cost caps with graceful degradation

OpenTelemetryGrafanaPostgreSQLSentry

See the engagement

SERVICE · 08 / 08

Structured output

Deterministic pipelines

JSON schema enforcement, validator loops, repair prompts. LLM as a structured component, not a chatbot.

Zod schemas mirror the API contract
Validator loop with bounded retries + repair prompts
Type-safe end-to-end, from model output to client

TypeScriptZodOpenAIAnthropic

See the engagement

[WHAT WE HAND OVER]

Six artifacts. All yours at week eight.

Eval harness

TypeScript test runner with golden sets, hard assertions, and LLM-as-judge soft scoring. Runs locally, in CI, and in production against live traffic.

Inference proxy

Cloudflare Worker in front of every model call. Vendor abstraction, retry/fallback, cost caps, PII redaction, and OpenTelemetry traces.

Retrieval layer

Postgres + pgvector with hybrid BM25 + dense search. Chunking pipeline tuned to your corpus. Citation surface in every answer.

Cost dashboard

Per-user, per-endpoint, per-prompt-version token + cost rollups. Drift alerts, daily anomaly reports, per-tenant caps.

Prompt registry

Versioned prompts with eval results attached. Deploy a prompt the way you deploy a service: review, test, ship, rollback.

Printed runbook

Twenty pages, paper-printed at handoff. How to debug, what to monitor, when to call us. Your team owns operations from day one.

[EIGHT-WEEK PROCESS]

The process, not a pitch deck.

Same five-step cadence on every engagement. Aligned to the K-Framework loop: Build, Measure, Reflect, Improve.

01Week 1
Discovery
Find the real problem.
Two-day workshop. We map the use case to the K-Framework, write the golden eval set with you, and decide direct-API vs RAG vs fine-tune. Output: a one-page engagement contract.
02Weeks 2–3
Build (Foundations)
Stand up the spine.
Inference proxy, retrieval layer if needed, eval harness, observability. The boring infra goes first so the interesting work has a place to land.
03Weeks 4–5
Build (Amplification)
Iterate on the prompt + retrieval.
Daily eval runs against the golden set. Hard assertions close the gate. We tune chunking, prompt structure, and model choice. Every change is measured, not guessed.
04Weeks 6–7
Build (Judgment)
Guardrails + drift detection.
Cost caps, PII redaction, schema validation, rate limits per tool. Drift-detection pipeline against production traffic. Runbook drafted alongside the engineering.
05Week 8
Ship
Handoff, not abandonment.
Code review with your team. Printed runbook. Eval suite walkthrough. 90 days of warranty support. After that, you own everything, including the right to extend it without us.

[NUMBERS · NOT ADJECTIVES]

Lead with a number.
The rest is noise.

8 wk

From problem to production

LLM frameworks in our stack

99.7%

Best eval pass rate shipped (Affidavit Mapp)

+18 pt

Activation lift on AICoach onboarding

Named K-Framework layers

100%

Source ownership at handoff

DIRECT LLM · APPLIED K

Bring the real LLM problem.
We’ll bring the build.

Eight weeks, fixed price, eval suite at handoff. Pick one of the eight engagements or bring a problem and we’ll scope it against the K. Two Q3 slots remain.

Start a build →Read the K-Framework

Direct LLM. No frameworks on quicksand.

How direct LLM workactually stays shipped.

Direct API integration

Evals before features

Boring infrastructure

Every LLM build runs on the K-Framework.

Pick the LLM problem.We’ll bring the build.

Six artifacts. All yours at week eight.

Eval harness

Inference proxy

Retrieval layer

Cost dashboard

Prompt registry

Printed runbook

The process, not a pitch deck.

Find the real problem.

Stand up the spine.

Iterate on the prompt + retrieval.

Guardrails + drift detection.

Handoff, not abandonment.

Lead with a number.The rest is noise.

Bring the real LLM problem.We’ll bring the build.

How direct LLM work
actually stays shipped.

Pick the LLM problem.
We’ll bring the build.

Lead with a number.
The rest is noise.

Bring the real LLM problem.
We’ll bring the build.