Deeper agent pool
Ultra coordinates a larger set of expert agents than standard Fugu, trading latency for accuracy. It is tuned for problems where a wrong answer is expensive: security review, scientific reproduction, patent and prior-art search.
Sakana AI's flagship orchestration model assembles and coordinates a deeper pool of expert agents to maximize answer quality on hard, multi-step problems. One OpenAI-compatible endpoint, frontier-class results, no single-vendor lock-in. For high-stakes engineering, science, and research work, this is the orchestration option we benchmark first.
You send a request to a single endpoint. Fugu Ultra decides how to handle it: solve directly when that is enough, or assemble and coordinate a team of expert models when the task calls for it. Model selection, delegation, verification, and synthesis happen internally.
Sakana reports Fugu Ultra standing shoulder to shoulder with leading closed models (Fable 5, Mythos Preview) across rigorous engineering, science, and reasoning benchmarks. Early users point it at Kaggle competitions, paper reproduction, cybersecurity analysis, and patent research.
Because Fugu routes across a pool rather than betting on one lab, you get frontier results without being exposed to a single vendor's outages, price moves, or export-control restrictions. Sakana pitches this explicitly against the controls now on Fable and Mythos.
Pay-as-you-go is $5 input and $30 output per million tokens ($10 / $45 above 272K context), with no fee stacking when several agents work on one request. Flat subscriptions ($20 / $100 / $200 a month) cover both variants for steady usage.
Fugu builds on two ICLR 2026 results: TRINITY, an evolved coordinator that assigns Thinker, Worker, and Verifier roles, and Conductor, an RL-trained policy that discovers natural-language coordination strategies. The orchestration logic is trained, not a brittle hand-designed workflow.
Fugu and Fugu Ultra figures are Sakana's reported numbers. Competitor columns are illustrative orientation, framed the way we frame every model: a starting point, not a verdict. Fugu Ultra leads on the hardest agentic and reasoning tasks; the standard Fugu already clears most everyday work.
| Capability | Fugu Ultra | Fugu | Claude Opus 4.8 | GPT-5.5 |
|---|---|---|---|---|
Agentic coding SWE-Bench Pro | 73.7 +14.7 vs Fugu | 59.0 | 71.0 | 69.5 |
Agentic terminal coding Terminal-Bench 2.1 | 82.1 +1.9 vs Fugu | 80.2 | 80.5 | 79.0 |
Code generation LiveCodeBench | 93.2 +0.3 vs Fugu | 92.9 | 90.0 | 91.0 |
Reasoning GPQA Diamond | 95.5 | 95.5 | 93.0 | 92.0 |
Scientific coding SciCode | 58.7 -1.4 vs Fugu | 60.1 | 57.0 | 56.0 |
Bug-finding depth Reported field use early-user account | 20+ issues | Strong | ~3 issues | ~3 issues |
Fugu and Fugu Ultra scores are as reported by Sakana AI (sakana.ai/fugu). Competitor figures are illustrative and used for orientation only. The bug-finding row reflects an early-user account ("where other tools flag about three issues, Sakana Fugu surfaced more than twenty"), not a controlled benchmark. We re-run our own evals on customer tasks before recommending any model.
What changes for the team building with it. Two comparisons that matter: Fugu Ultra against the standard Fugu it sits above, and against a closed flagship (Claude Opus 4.8) on the dimensions a buyer actually weighs: capability, cost, control, and risk.
| Dimension | vs Fugu (standard) | vs Claude Opus 4.8 |
|---|---|---|
Hard, multi-step work | +14.7 on SWE-Bench Pro over standard Fugu. The deeper agent pool is the difference on long, high-stakes tasks: research reproduction, security analysis, competition-grade problems. | Reported parity with the closed leaders on engineering and science benchmarks, achieved by coordinating a pool rather than a single network. On the hardest agentic coding it edges ahead in Sakana's numbers. |
Cost and latency | Higher per-token price and more latency than standard Fugu, by design: it recruits more agents. Use Ultra where accuracy is the constraint, standard Fugu where responsiveness is. | $5 input / $30 output per million is in the same range as Opus ($5 / $25). You are paying a similar rate for an orchestrated answer with no fee stacking across agents, plus flat-rate subscription options. |
Control and vendor risk | Same single endpoint and proprietary routing as standard Fugu. The pool is hidden by design, which is the trade for the orchestration. | Opus is one vendor and one model. Fugu spreads the bet across a pool, so a single vendor's outage, price change, or export control does not strand you. The cost is that you do not choose or see which models run. |
Routing and verification | Deeper coordination: more Thinker / Worker / Verifier passes per request, so answers are checked before they return. That is where the accuracy gain comes from. | Verification is built into the orchestration rather than something you wire yourself. With a single model you own the eval and retry loop; with Fugu Ultra a learned coordinator does the first pass. |
Inside a Kensink build, Fugu Ultra is a routing option behind the same vendor-neutral abstraction as Claude and GPT. The agent picks it for hard, high-stakes steps and falls back to a cheaper model for the easy ones, decided at runtime, not frozen at design time.
Ultra coordinates a larger set of expert agents than standard Fugu, trading latency for accuracy. It is tuned for problems where a wrong answer is expensive: security review, scientific reproduction, patent and prior-art search.
An evolved coordinator assigns Thinker, Worker, and Verifier roles, and an RL-trained policy designs how the agents talk to each other. The orchestration is trained on coding, math, and reasoning rather than hand-scripted, so it adapts per task.
Ultra supports large contexts, with a pricing step above 272K tokens. That suits codebase-scale tasks and long agent transcripts. As always, retrieval and context hygiene beat stuffing the window, and we build for that.
Fugu speaks an OpenAI-compatible API, so adding it next to Claude and GPT in our provider layer is a config change plus an eval pass, not a rewrite. The multi-agent system is hidden behind one model call.
When several agents work on one request, you are billed once at the model rate, not per agent. Pricing stays predictable even when Ultra recruits a large team internally.
You pay for the answer, not for every agent. Fugu bills per token with no fee stacking when a request fans out across a team of models, plus flat monthly plans that cover both variants.
Fugu blurs the line on purpose. It is a coordinator over a pool of models, packaged so you call it like a single model. The useful question for a buyer is not taxonomy but behaviour: latency, cost, reliability, and whether the answer is right. We evaluate it as we would any model, on your real tasks, and let the results decide.
Sakana does not disclose which models Fugu selects. That hides the orchestration that makes it work, and it removes a lever you would normally control: you cannot pin a version or audit exactly which model produced a given output. For some regulated workloads that opacity is a blocker. For many it is an acceptable trade for not managing the routing yourself.
Sakana frames Fugu against single-vendor dependency and export controls, pointing at the restrictions now on Fable and Mythos. Routing across a pool genuinely spreads that risk. The honest caveat is that you are now dependent on Sakana's orchestrator instead, so we treat it as one routing option behind our own abstraction, not the whole stack.
For hard, high-stakes work where a wrong answer is costly, Ultra is the orchestration option we put on the eval suite before reaching for a single closed flagship. If it clears the quality bar on your tasks, the built-in verification is a real advantage.
The bet is that the best systems are coordinated ecosystems, not single monoliths. We think that direction is right, and Fugu is the clearest commercial expression of it so far. We still prove fit with our own evals rather than taking the framing on faith.
A hidden, proprietary pool is fine for a coding assistant and a problem for an audited, regulated pipeline. We make that call with evidence: content-policy probes, provenance questions, and a clear read of what you can and cannot see.
Fugu Ultra is one more routing option in our provider layer. The agent picks Fugu, Claude, or GPT by task difficulty and cost at runtime. No lock-in, no rewrite, and a fallback path for the steps that need a model you fully control.
Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.
Read the K-FrameworkDirect API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.
An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.
Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.
A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.
Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.
Real distributions, not averages. We know which routes are slow, and why.
The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.
PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.
Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.