Kensink Labs
CURRENT VERSIONAnthropicModel brief
CLAUDE OPUS · VERSION 4.8 · 28 MAY 2026

Claude Opus 4.8. Sharper agents, same price.

Anthropic's most capable Opus, with measurably better honesty, sharper agentic judgement, and a cheaper fast mode. Same standard price as 4.7. For the teams we build for, this is the default to move to.

LLM APIclaude-opus-4-8Frontier LLM providersEval pipelines
Released
28 May 2026
Model ID
claude-opus-4-8
Input
$5 / 1M tokens
Output
$25 / 1M tokens
[TL;DR FOR CEO + CTO]

Five things to know.

  • 01

    Price did not change.

    Standard pricing stays at $5 input and $25 output per million tokens, identical to Opus 4.7. No budget review required to move.

  • 02

    Around 4× less likely to claim work it cannot back up.

    Anthropic's evaluations show 4.8 is roughly four times less likely than 4.7 to let flaws in code it writes pass unremarked. For agentic work this is the headline win.

  • 03

    Dynamic workflows scale Claude Code to real migrations.

    Claude Code can now plan a task, run hundreds of parallel subagents in one session, and verify outputs before reporting back. The published example is codebase-scale migrations across hundreds of thousands of lines.

  • 04

    Mid-conversation system messages, without breaking the prompt cache.

    The Messages API now accepts system entries inside the messages array. Permissions, token budgets, or environment context can change as an agent runs, without forcing a cache miss.

  • 05

    Fast mode is 2.5× faster, and cheaper than before.

    Fast mode delivers full Opus 4.8 intelligence at roughly 2.5× the output speed, priced at 2× standard. Anthropic notes fast mode now costs less than it did for prior Opus models.

[BENCHMARKS]

How it stacks up.

From Anthropic's reported numbers. Opus 4.8 leads its class on coding, reasoning, computer use, knowledge work, and financial agentic tasks; trails GPT-5.5 only on agentic terminal coding.

CapabilityOpus 4.8Opus 4.7GPT-5.5Gemini 3.1 Pro
Agentic coding
SWE-Bench Pro
69.2%
+4.9 pts vs 4.7
64.3%
58.6%
54.2%
Agentic terminal coding
Terminal-Bench 2.1
Terminus-2 public harness
74.6%
+8.5 pts vs 4.7
66.1%
78.2%
70.3%
Multidisciplinary reasoning
Humanity's Last Exam
no tools / with tools
49.8% / 57.9%
+2.9 pts vs 4.7
46.9% / 54.7%
41.4% / 52.2%
44.4% / 51.4%
Agentic computer use
OSWorld-Verified
83.4%
+0.6 pts vs 4.7
82.8%
78.7%
76.2%
Knowledge work
GDPval-AA
1890
+137 vs 4.7
1753
1769
1314
Agentic financial analysis
Finance Agent v2
53.9%
+2.4 pts vs 4.7
51.5%
51.8%
43.0%

Numbers as reported by Anthropic on 28 May 2026. We re-run our own evals on customer tasks before recommending a switch.

[SOFTWARE DEVELOPMENT IMPACT]

What it changes for the team building with it.

What changes for the engineering team. Two comparisons that matter: this Opus against the one it replaces (4.7), and against where Sonnet sits in the family (faster and cheaper, near-frontier on routine work).

Dimensionvs Opus 4.7vs latest Sonnet
Production agents
Around 4× less likely to claim work it cannot back up. The honesty improvement matters most on long-running agentic runs.Opus 4.8 is the right default for hard reasoning and autonomy. Route easy, high-volume steps to Sonnet.
Coding workflows
+4.9 pts on SWE-Bench Pro and +8.5 pts on Terminal-Bench 2.1 vs Opus 4.7. Dynamic workflows in Claude Code now run hundreds of parallel subagents in one session.Sonnet handles most local edits and review well at a fraction of the cost. Reach for Opus on architectural changes, migrations, and multi-file refactors.
Cost and latency
Identical standard pricing to Opus 4.7 ($5 / $25 per million). Fast mode is now cheaper than it was for prior Opus.Opus costs more per token than Sonnet across the board. The economics work when an Opus call replaces multiple Sonnet retries.
Migration risk
Behind a vendor-neutral abstraction, the switch is a config change plus an eval pass. Most prompts work identically.Different positioning, not a replacement. We run both behind the same abstraction and route by task difficulty.

Inside a Kensink build, model selection is a routing decision the agent makes at runtime, not a vendor commitment frozen at design time.

[WHAT IS NEW]

The features that ship with it.

01

Mid-conversation system messages

Update Claude's instructions mid-task via system entries inside the messages array. The prompt cache stays warm. Useful for permissions, token budgets, and environment context that change while an agent runs.

02

Dynamic workflows in Claude Code

Claude plans a task, fans out to hundreds of parallel subagents in one session, then verifies outputs before reporting back. Available on Enterprise, Team, and Max plans. Research preview.

03

Self-hosted sandboxes for Claude Managed Agents

Run managed agents while keeping sensitive files, packages, and services in your own infrastructure or with a managed sandbox provider. Public beta.

04

MCP tunnels

Connect agents to MCP servers inside your private network without exposing them to the public internet. Research preview.

05

Effort control on claude.ai

A control alongside the model selector lets users dial how much effort Claude puts into a response. Higher effort means more thinking and better answers. Lower effort means faster responses and lighter rate-limit consumption.

[PRICING]

What it costs.

Standard
$5 input
$25 output
Per million tokens. Identical to Opus 4.7. No budget review required to move.
Fast mode2.5× speed
$10 input
$50 output
Full Opus 4.8 intelligence at roughly 2.5× output speed. Anthropic notes fast mode now costs less than it did on prior Opus.
[ALIGNMENT + SAFETY]

What the alignment data says.

Honesty is the headline.

Anthropic reports that 4.8 is around four times less likely than 4.7 to allow flaws in code it has written to pass unremarked. Early testers report it is more likely to flag uncertainty and less likely to make unsupported claims.

Misalignment rates are substantially lower than 4.7.

Anthropic's Alignment team rates misaligned behaviours (such as deception or cooperation with misuse) as substantially lower than 4.7, and similar to Claude Mythos Preview, their best-aligned model.

Prosocial traits at new highs.

The pre-release alignment assessment notes new highs on measures like supporting user autonomy and acting in the user's best interest. The full assessment is in the Opus 4.8 System Card.

[OUR TAKE]

What this means for the build.

01

We are switching the default to 4.8.

Behind our vendor-neutral abstraction, moving from 4.7 to 4.8 is a config change. Standard price is the same, so there is no commercial reason to wait. We re-run the customer eval suite on the switch, document the diff, and ship.

02

It matters most for long-running agents.

The honesty improvement and stronger judgement on agentic tasks are exactly the failure modes that show up in production agents over hours and days. Translation, deep research, slide-building, codebase migrations: this is where 4.8 earns its keep.

03

Fast mode is now an honest option.

Fast mode is 2× standard pricing for 2.5× the output speed, and cheaper than fast mode on prior Opus. For latency-sensitive paths inside a build (chat UIs, interactive agents) it is now worth modelling against the standard tier on cost.

04

Effort control is a budget tool, not just a UX one.

Higher effort is the default and is the best quality on most tasks. Lower effort saves rate-limit budget for the parts of a workflow that do not need it. We route by task difficulty inside the agent, so cost matches difficulty.

[METHODOLOGY · K-FRAMEWORK]

Integrated through the
K-Framework.

Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.

Read the K-Framework
01

Foundations

Direct API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.

02

Amplification

An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.

03

Judgment

Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.

[OBSERVABILITY]

Observability your team can read.

A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.

Instrumented

Cost per call

Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.

Instrumented

Latency p50 / p95 / p99

Real distributions, not averages. We know which routes are slow, and why.

Instrumented

Eval pass rates

The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.

Instrumented

Prompt + completion logs

PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.

Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.

[COMMON QUESTIONS]

Questions we are getting asked.

Should we move from Opus 4.7 to 4.8 right away?
Almost always yes. Pricing is unchanged at $5 input and $25 output per million tokens, and 4.8 is stronger across our usual benchmarks with notable improvements in honesty on agentic work. We re-run the eval suite, document the diff, and switch the default model in a config change.
What is the migration cost?
For projects built behind a vendor-neutral abstraction (the way we build), the migration is a config change plus an eval pass. Most prompts work identically. If your team built directly against the SDK with no abstraction, the cost is one engineering day to add the seam, then the same eval pass.
When does fast mode pay off?
Two cases. First: latency-sensitive interactive paths (chat UIs, in-app agents) where 2.5× speed at 2× price is a worthwhile trade. Second: long-running asynchronous workflows where wall-clock time has a real business cost. For batch or cost-sensitive workloads, standard tier remains the right pick.
Does the new alignment data mean we can relax our own evals?
No. Anthropic's alignment improvements are real and welcome, but model-level alignment does not replace task-specific evals on your data, your prompts, and your guardrails. We run the customer eval suite on every model change, every release.
What about Mythos and the next class above Opus?
Anthropic has signalled a higher class of model (Mythos) is coming, gated behind stronger cyber safeguards. They are using it now with a small number of organisations for cybersecurity work. We will evaluate it for our customers when it is generally available.
DIRECT INTEGRATION · NO FRAMEWORK

Want Claude Opus 4.8
in your product?

Eight weeks, fixed price, eval suite at handoff. We integrate against the model API the same way we integrate against Postgres.