Mid-conversation system messages
Update Claude's instructions mid-task via system entries inside the messages array. The prompt cache stays warm. Useful for permissions, token budgets, and environment context that change while an agent runs.
Anthropic's most capable Opus, with measurably better honesty, sharper agentic judgement, and a cheaper fast mode. Same standard price as 4.7. For the teams we build for, this is the default to move to.
Standard pricing stays at $5 input and $25 output per million tokens, identical to Opus 4.7. No budget review required to move.
Anthropic's evaluations show 4.8 is roughly four times less likely than 4.7 to let flaws in code it writes pass unremarked. For agentic work this is the headline win.
Claude Code can now plan a task, run hundreds of parallel subagents in one session, and verify outputs before reporting back. The published example is codebase-scale migrations across hundreds of thousands of lines.
The Messages API now accepts system entries inside the messages array. Permissions, token budgets, or environment context can change as an agent runs, without forcing a cache miss.
Fast mode delivers full Opus 4.8 intelligence at roughly 2.5× the output speed, priced at 2× standard. Anthropic notes fast mode now costs less than it did for prior Opus models.
From Anthropic's reported numbers. Opus 4.8 leads its class on coding, reasoning, computer use, knowledge work, and financial agentic tasks; trails GPT-5.5 only on agentic terminal coding.
| Capability | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
Agentic coding SWE-Bench Pro | 69.2% +4.9 pts vs 4.7 | 64.3% | 58.6% | 54.2% |
Agentic terminal coding Terminal-Bench 2.1 Terminus-2 public harness | 74.6% +8.5 pts vs 4.7 | 66.1% | 78.2% | 70.3% |
Multidisciplinary reasoning Humanity's Last Exam no tools / with tools | 49.8% / 57.9% +2.9 pts vs 4.7 | 46.9% / 54.7% | 41.4% / 52.2% | 44.4% / 51.4% |
Agentic computer use OSWorld-Verified | 83.4% +0.6 pts vs 4.7 | 82.8% | 78.7% | 76.2% |
Knowledge work GDPval-AA | 1890 +137 vs 4.7 | 1753 | 1769 | 1314 |
Agentic financial analysis Finance Agent v2 | 53.9% +2.4 pts vs 4.7 | 51.5% | 51.8% | 43.0% |
Numbers as reported by Anthropic on 28 May 2026. We re-run our own evals on customer tasks before recommending a switch.
What changes for the engineering team. Two comparisons that matter: this Opus against the one it replaces (4.7), and against where Sonnet sits in the family (faster and cheaper, near-frontier on routine work).
| Dimension | vs Opus 4.7 | vs latest Sonnet |
|---|---|---|
Production agents | Around 4× less likely to claim work it cannot back up. The honesty improvement matters most on long-running agentic runs. | Opus 4.8 is the right default for hard reasoning and autonomy. Route easy, high-volume steps to Sonnet. |
Coding workflows | +4.9 pts on SWE-Bench Pro and +8.5 pts on Terminal-Bench 2.1 vs Opus 4.7. Dynamic workflows in Claude Code now run hundreds of parallel subagents in one session. | Sonnet handles most local edits and review well at a fraction of the cost. Reach for Opus on architectural changes, migrations, and multi-file refactors. |
Cost and latency | Identical standard pricing to Opus 4.7 ($5 / $25 per million). Fast mode is now cheaper than it was for prior Opus. | Opus costs more per token than Sonnet across the board. The economics work when an Opus call replaces multiple Sonnet retries. |
Migration risk | Behind a vendor-neutral abstraction, the switch is a config change plus an eval pass. Most prompts work identically. | Different positioning, not a replacement. We run both behind the same abstraction and route by task difficulty. |
Inside a Kensink build, model selection is a routing decision the agent makes at runtime, not a vendor commitment frozen at design time.
Update Claude's instructions mid-task via system entries inside the messages array. The prompt cache stays warm. Useful for permissions, token budgets, and environment context that change while an agent runs.
Claude plans a task, fans out to hundreds of parallel subagents in one session, then verifies outputs before reporting back. Available on Enterprise, Team, and Max plans. Research preview.
Run managed agents while keeping sensitive files, packages, and services in your own infrastructure or with a managed sandbox provider. Public beta.
Connect agents to MCP servers inside your private network without exposing them to the public internet. Research preview.
A control alongside the model selector lets users dial how much effort Claude puts into a response. Higher effort means more thinking and better answers. Lower effort means faster responses and lighter rate-limit consumption.
Anthropic reports that 4.8 is around four times less likely than 4.7 to allow flaws in code it has written to pass unremarked. Early testers report it is more likely to flag uncertainty and less likely to make unsupported claims.
Anthropic's Alignment team rates misaligned behaviours (such as deception or cooperation with misuse) as substantially lower than 4.7, and similar to Claude Mythos Preview, their best-aligned model.
The pre-release alignment assessment notes new highs on measures like supporting user autonomy and acting in the user's best interest. The full assessment is in the Opus 4.8 System Card.
Behind our vendor-neutral abstraction, moving from 4.7 to 4.8 is a config change. Standard price is the same, so there is no commercial reason to wait. We re-run the customer eval suite on the switch, document the diff, and ship.
The honesty improvement and stronger judgement on agentic tasks are exactly the failure modes that show up in production agents over hours and days. Translation, deep research, slide-building, codebase migrations: this is where 4.8 earns its keep.
Fast mode is 2× standard pricing for 2.5× the output speed, and cheaper than fast mode on prior Opus. For latency-sensitive paths inside a build (chat UIs, interactive agents) it is now worth modelling against the standard tier on cost.
Higher effort is the default and is the best quality on most tasks. Lower effort saves rate-limit budget for the parts of a workflow that do not need it. We route by task difficulty inside the agent, so cost matches difficulty.
Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.
Read the K-FrameworkDirect API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.
An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.
Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.
A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.
Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.
Real distributions, not averages. We know which routes are slow, and why.
The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.
PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.
Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.