Kensink Labs
Sakana Fugu
ORCHESTRATION · FLAGSHIPSakana AIModel brief
SAKANA FUGU · FUGU ULTRA · 22 JUN 2026

Fugu Ultra. Frontier accuracy from a coordinated pool of models.

Sakana AI's flagship orchestration model assembles and coordinates a deeper pool of expert agents to maximize answer quality on hard, multi-step problems. One OpenAI-compatible endpoint, frontier-class results, no single-vendor lock-in. For high-stakes engineering, science, and research work, this is the orchestration option we benchmark first.

Orchestrationfugu-ultra-20260615Multi-agentEval pipelines
Released
22 Jun 2026
Model ID
fugu-ultra-20260615
Input
$5 / 1M
Output
$30 / 1M
[TL;DR FOR CEO + CTO]

Five things to know.

  • 01

    A multi-agent system that behaves like one model.

    You send a request to a single endpoint. Fugu Ultra decides how to handle it: solve directly when that is enough, or assemble and coordinate a team of expert models when the task calls for it. Model selection, delegation, verification, and synthesis happen internally.

  • 02

    Frontier accuracy on hard, high-stakes work.

    Sakana reports Fugu Ultra standing shoulder to shoulder with leading closed models (Fable 5, Mythos Preview) across rigorous engineering, science, and reasoning benchmarks. Early users point it at Kaggle competitions, paper reproduction, cybersecurity analysis, and patent research.

  • 03

    Frontier capability without single-vendor dependency.

    Because Fugu routes across a pool rather than betting on one lab, you get frontier results without being exposed to a single vendor's outages, price moves, or export-control restrictions. Sakana pitches this explicitly against the controls now on Fable and Mythos.

  • 04

    Pay for the answer, not for every agent.

    Pay-as-you-go is $5 input and $30 output per million tokens ($10 / $45 above 272K context), with no fee stacking when several agents work on one request. Flat subscriptions ($20 / $100 / $200 a month) cover both variants for steady usage.

  • 05

    The coordination is learned, not hand-wired.

    Fugu builds on two ICLR 2026 results: TRINITY, an evolved coordinator that assigns Thinker, Worker, and Verifier roles, and Conductor, an RL-trained policy that discovers natural-language coordination strategies. The orchestration logic is trained, not a brittle hand-designed workflow.

[BENCHMARKS]

How it stacks up.

Fugu and Fugu Ultra figures are Sakana's reported numbers. Competitor columns are illustrative orientation, framed the way we frame every model: a starting point, not a verdict. Fugu Ultra leads on the hardest agentic and reasoning tasks; the standard Fugu already clears most everyday work.

CapabilityFugu UltraFuguClaude Opus 4.8GPT-5.5
Agentic coding
SWE-Bench Pro
73.7
+14.7 vs Fugu
59.0
71.0
69.5
Agentic terminal coding
Terminal-Bench 2.1
82.1
+1.9 vs Fugu
80.2
80.5
79.0
Code generation
LiveCodeBench
93.2
+0.3 vs Fugu
92.9
90.0
91.0
Reasoning
GPQA Diamond
95.5
95.5
93.0
92.0
Scientific coding
SciCode
58.7
-1.4 vs Fugu
60.1
57.0
56.0
Bug-finding depth
Reported field use
early-user account
20+ issues
Strong
~3 issues
~3 issues

Fugu and Fugu Ultra scores are as reported by Sakana AI (sakana.ai/fugu). Competitor figures are illustrative and used for orientation only. The bug-finding row reflects an early-user account ("where other tools flag about three issues, Sakana Fugu surfaced more than twenty"), not a controlled benchmark. We re-run our own evals on customer tasks before recommending any model.

[SOFTWARE DEVELOPMENT IMPACT]

What it changes for the team building with it.

What changes for the team building with it. Two comparisons that matter: Fugu Ultra against the standard Fugu it sits above, and against a closed flagship (Claude Opus 4.8) on the dimensions a buyer actually weighs: capability, cost, control, and risk.

Dimensionvs Fugu (standard)vs Claude Opus 4.8
Hard, multi-step work
+14.7 on SWE-Bench Pro over standard Fugu. The deeper agent pool is the difference on long, high-stakes tasks: research reproduction, security analysis, competition-grade problems.Reported parity with the closed leaders on engineering and science benchmarks, achieved by coordinating a pool rather than a single network. On the hardest agentic coding it edges ahead in Sakana's numbers.
Cost and latency
Higher per-token price and more latency than standard Fugu, by design: it recruits more agents. Use Ultra where accuracy is the constraint, standard Fugu where responsiveness is.$5 input / $30 output per million is in the same range as Opus ($5 / $25). You are paying a similar rate for an orchestrated answer with no fee stacking across agents, plus flat-rate subscription options.
Control and vendor risk
Same single endpoint and proprietary routing as standard Fugu. The pool is hidden by design, which is the trade for the orchestration.Opus is one vendor and one model. Fugu spreads the bet across a pool, so a single vendor's outage, price change, or export control does not strand you. The cost is that you do not choose or see which models run.
Routing and verification
Deeper coordination: more Thinker / Worker / Verifier passes per request, so answers are checked before they return. That is where the accuracy gain comes from.Verification is built into the orchestration rather than something you wire yourself. With a single model you own the eval and retry loop; with Fugu Ultra a learned coordinator does the first pass.

Inside a Kensink build, Fugu Ultra is a routing option behind the same vendor-neutral abstraction as Claude and GPT. The agent picks it for hard, high-stakes steps and falls back to a cheaper model for the easy ones, decided at runtime, not frozen at design time.

[WHAT IS NEW]

The features that ship with it.

01

Deeper agent pool

Ultra coordinates a larger set of expert agents than standard Fugu, trading latency for accuracy. It is tuned for problems where a wrong answer is expensive: security review, scientific reproduction, patent and prior-art search.

02

Learned coordination (TRINITY + Conductor)

An evolved coordinator assigns Thinker, Worker, and Verifier roles, and an RL-trained policy designs how the agents talk to each other. The orchestration is trained on coding, math, and reasoning rather than hand-scripted, so it adapts per task.

03

Long context

Ultra supports large contexts, with a pricing step above 272K tokens. That suits codebase-scale tasks and long agent transcripts. As always, retrieval and context hygiene beat stuffing the window, and we build for that.

04

OpenAI-compatible, single endpoint

Fugu speaks an OpenAI-compatible API, so adding it next to Claude and GPT in our provider layer is a config change plus an eval pass, not a rewrite. The multi-agent system is hidden behind one model call.

05

No fee stacking

When several agents work on one request, you are billed once at the model rate, not per agent. Pricing stays predictable even when Ultra recruits a large team internally.

[VALUE FOR COST]

What it costs.

You pay for the answer, not for every agent. Fugu bills per token with no fee stacking when a request fans out across a team of models, plus flat monthly plans that cover both variants.

Pay-as-you-go
Input
$5 / 1M
Output
$30 / 1M
Per million tokens, in the same range as Claude Opus. Above 272K context the rate steps to $10 input / $45 output, and cached input is $0.50 ($1.00 above 272K). No fee stacking when several agents run on one request.
SubscriptionBoth variants
$20 Standard
$100 Pro · 10×

$200 Max · 20×
Flat monthly plans include both Fugu and Fugu Ultra. Pro and Max raise the usage allowance (roughly 10x and 20x). For high-volume or enterprise workloads, pay-as-you-go bills per token instead.
[ORCHESTRATION + TRADE-OFFS]

The orchestration question.

Is a multi-agent system really a model?

Fugu blurs the line on purpose. It is a coordinator over a pool of models, packaged so you call it like a single model. The useful question for a buyer is not taxonomy but behaviour: latency, cost, reliability, and whether the answer is right. We evaluate it as we would any model, on your real tasks, and let the results decide.

The pool is proprietary, and that cuts both ways.

Sakana does not disclose which models Fugu selects. That hides the orchestration that makes it work, and it removes a lever you would normally control: you cannot pin a version or audit exactly which model produced a given output. For some regulated workloads that opacity is a blocker. For many it is an acceptable trade for not managing the routing yourself.

Vendor independence is the real pitch.

Sakana frames Fugu against single-vendor dependency and export controls, pointing at the restrictions now on Fable and Mythos. Routing across a pool genuinely spreads that risk. The honest caveat is that you are now dependent on Sakana's orchestrator instead, so we treat it as one routing option behind our own abstraction, not the whole stack.

[OUR TAKE]

What this means for the build.

01

We benchmark Ultra first when accuracy is the constraint.

For hard, high-stakes work where a wrong answer is costly, Ultra is the orchestration option we put on the eval suite before reaching for a single closed flagship. If it clears the quality bar on your tasks, the built-in verification is a real advantage.

02

Orchestration-as-a-product is a genuine shift.

The bet is that the best systems are coordinated ecosystems, not single monoliths. We think that direction is right, and Fugu is the clearest commercial expression of it so far. We still prove fit with our own evals rather than taking the framing on faith.

03

Weigh the opacity against the workload.

A hidden, proprietary pool is fine for a coding assistant and a problem for an audited, regulated pipeline. We make that call with evidence: content-policy probes, provenance questions, and a clear read of what you can and cannot see.

04

It runs behind the same abstraction as everything else.

Fugu Ultra is one more routing option in our provider layer. The agent picks Fugu, Claude, or GPT by task difficulty and cost at runtime. No lock-in, no rewrite, and a fallback path for the steps that need a model you fully control.

[METHODOLOGY · K-FRAMEWORK]

Integrated through the
K-Framework.

Every model we integrate runs through the same operating system. Three pillars, sixteen layers, one Compound Growth Loop. The methodology that keeps AI work from rotting after the first ship.

Read the K-Framework
01

Foundations

Direct API integration with the model. No LangChain, no orchestration vendor, no agent framework built on quicksand. Typed contracts, the same way we wire up Postgres.

02

Amplification

An eval suite built from your real tasks gates every prompt and model change. Quality is measured before it ships, not vibed in a demo.

03

Judgment

Governance, audit, and oversight wired in from day one. Who called what, with which prompt version, at what cost. Your auditors get answers, not screenshots.

[OBSERVABILITY]

Observability your team can read.

A model in production without observability is roulette. We instrument every integration so engineering and finance can see the same numbers, and so a regression at 3am surfaces before a customer opens a ticket.

Instrumented

Cost per call

Tokens in, tokens out, dollars spent. Sliced by feature, tenant, and route. Budgets enforced where it matters.

Instrumented

Latency p50 / p95 / p99

Real distributions, not averages. We know which routes are slow, and why.

Instrumented

Eval pass rates

The same eval suite that gates a release runs continuously in production. A regression on real traffic surfaces fast.

Instrumented

Prompt + completion logs

PII scrubbed at the proxy, shipped to your SIEM. Retention controls match your compliance window.

Dashboards your team owns, not ours. At handoff you get the queries, the alerts, and the runbook. We are not in the path to read your metrics.

[COMMON QUESTIONS]

Questions we are getting asked.

How is Fugu Ultra different from just calling several models myself?
Fugu does the orchestration for you: it selects which models to use, delegates subtasks, verifies intermediate results, and synthesizes a final answer, all behind one endpoint. Building that coordination yourself is real engineering. The trade is that Fugu's routing is proprietary, so you give up control and visibility over which models run.
What does it cost?
Pay-as-you-go is $5 input and $30 output per million tokens, stepping to $10 / $45 above 272K context, with cached input at $0.50. That is roughly Claude Opus territory. There is no fee stacking when multiple agents work on one request. Flat subscriptions are $20, $100, and $200 a month and include both variants.
Is it really as good as the closed frontier models?
Sakana reports Fugu Ultra standing shoulder to shoulder with Fable 5 and Mythos Preview on rigorous engineering, science, and reasoning benchmarks, and ahead on some agentic coding. Those are vendor numbers. We treat them as a starting point and re-run our own evals on your tasks before routing real traffic.
Can I see or pin which models Fugu uses?
No. The pool and the routing are proprietary, and you cannot pin a specific underlying model or version. For most product work that is acceptable. For audited or regulated pipelines where you must attest to exactly which model produced an output, it can be a blocker, and we will say so.
Where is it available?
Fugu is generally available via an OpenAI-compatible API and the Sakana console as of 22 June 2026, with the EU and EEA listed as unavailable at launch. Behind our vendor-neutral abstraction, adding it is a config change plus an eval pass, with a fallback model for regions or workloads it cannot serve.
DIRECT INTEGRATION · ONE ENDPOINT

Want Fugu Ultra
in your product?

Eval suite at handoff, full source ownership. We integrate Fugu the same way we integrate Postgres, behind a vendor-neutral abstraction with a fallback to a model you fully control. Sized to your scope.