The Japanese company Sakana AI launched Fugu on June 22, 2026, and benchmark headlines immediately framed it as a rival to Claude Mythos-Preview (among other models, of course). One number stopped developers cold: 73.7% on SWE-Bench Pro, beating every individual model in its own pool. In our hands-on testing, the honest comparison is more complicated than that, and the gap between Fugu Ultra and the actual Claude Mythos 5 tells a different story than Sakana’s marketing suggests.
Read on to learn all about Sakana Fugu and how it compares to Claude Mythos 5.
What Is Sakana Fugu?

Fugu is not a new base model. Sakana built it as a multi-agent orchestration system that routes prompts across existing frontier LLMs through a single OpenAI-compatible API endpoint. The system handles routing, delegation, verification, and synthesis internally. The underlying architecture draws from two ICLR 2026 papers: Trinity and Conductor.
The model pool currently includes GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, and others. Notably absent from that pool, however, are Claude Fable 5 and Claude Mythos 5. Sakana states the reason plainly: both models “lack public accessibility.” That is not a capability judgment. On June 12, 2026, a US government export control directive suspended access to both Claude models. Fugu cannot use what it cannot reach.
Fugu ships in two variants with different performance profiles.
- Fugu — optimized for speed and latency, suitable for interactive developer workflows
- Fugu Ultra — optimized for maximum quality, uses the full orchestration depth for complex reasoning tasks
- Context window: 1,000,000 tokens across both variants
- Supports function calling, structured outputs, and reasoning effort levels (high, xhigh/max)
- Not available in EU/EEA regions
- Claude Mythos 5 and Fable 5 are absent from the agent pool entirely

The architecture Sakana calls “Composition of Experts” is the core differentiator. Rather than a single model producing an answer, a conductor component assigns subtasks to specialized agents, then synthesizes results. The conductor decides which model in the pool is best suited for each subtask, runs parallel execution where possible, and merges outputs into a coherent response. The developer never sees this routing. One API call goes in, one structured response comes out.
Sakana built this system with an explicit enterprise thesis. Their framing: “For an organization or a nation, relying on a single company’s APIs for critical infrastructure, finance, or governance is a material vulnerability.” That argument is pointed at procurement decisions, not individual developer workflows. It positions Fugu as a hedge against the exact kind of disruption that hit Claude Mythos 5 in June 2026.
Sakana Fugu vs Claude Mythos: Benchmark Breakdown
Sakana’s published benchmarks compare Fugu Ultra against Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. They do not compare against Claude Mythos 5. The comparison target for Mythos in Sakana’s own materials is “Mythos Preview,” an earlier and weaker model. That framing matters because Mythos 5 scores exceed Fugu Ultra on every major overlapping benchmark where data exists for both.
| Benchmark | Fugu Ultra | Claude Mythos 5 | Claude Opus 4.8 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|---|---|
| SWE-Bench Pro | 73.7% | 80.3% | 69.2% | 54.2% | 58.6% |
| Terminal-Bench 2.1 | 82.1% | 88.0% | 74.6% | 70.3% | 78.2% |
| Humanity’s Last Exam | 50.0% | 59.0% | 49.8% | 44.4% | 41.4% |
| GPQA-Diamond | 95.5% | 94.1% | 92.0% | 94.3% | 93.6% |
| LiveCodeBench | 93.2% | N/A | 87.8% | 88.5% | 85.3% |
| LiveCodeBench Pro | 90.8% | N/A | 84.8% | 82.9% | 88.4% |
The gaps are meaningful. Mythos 5 leads by 6.6 points on SWE-Bench Pro, 5.9 points on Terminal-Bench 2.1, and 9 full points on Humanity’s Last Exam. These are not marginal differences. They represent consistently stronger reasoning across coding, terminal tasks, and multi-discipline knowledge. BenchLM.ai currently ranks Claude Mythos 5 first out of 124 models evaluated.
SWE-Bench Verified tells an even sharper story. Claude Mythos 5 scores 95% on that benchmark. No equivalent Fugu number exists because Sakana did not publish it. That omission is deliberate or incidental, but either way, it leaves a gap in the comparison where Mythos 5 is strongest.
One critical caveat applies to all Fugu Ultra numbers: every benchmark result is vendor-reported. Sakana has not published an evaluation harness. No independent third party has rerun these tasks. That does not mean the numbers are wrong, but it does mean they cannot yet be treated on the same footing as externally verified scores.
Where Fugu Ultra Genuinely Leads
Two results from Fugu Ultra’s benchmark sheet hold up as genuine wins. GPQA-Diamond at 95.5% and LiveCodeBench at 93.2% are both above every individual model in Fugu’s own pool. These scores suggest that the orchestration layer adds real signal, not just latency. The conductor architecture earns something on these tasks that no single model achieves alone.
Beta user feedback on code review reinforces this. One early user reported Fugu Ultra flagged over 20 issues in a code review task where competing models found roughly 3. That ratio is striking, and it fits the architecture. An orchestrator running multiple verification passes will surface more than a single model in one pass.
The early beta covered about 500 users across these use cases.
- Code review — Fugu Ultra’s multi-pass verification outperformed GPT-5.5 and Opus 4.8 by a wide margin in user reports
- Data science research — multi-model synthesis on long analytical tasks benefited from the 1M token context
- Paper reproduction — tasks requiring cross-referencing across large document sets played to Fugu’s orchestration strengths
- Cybersecurity analysis — the structured reasoning pipeline helped with multi-step exploit reasoning
- Literature and patent investigations — breadth-first search across many sources fit the agent delegation model well
These are the task shapes where a conductor routing across multiple strong models has a structural advantage over a single model working alone. They are also the shapes where latency matters less than completeness.
The Social Media Reaction: What Developers Are Actually Saying
The Hacker News community’s dominant read landed on a clear frame: “A premium model router with a very good marketing story.” That is not entirely dismissive, but it signals that developers are not treating Fugu as a new capability frontier. They are treating it as infrastructure with good defaults. The distinction matters for how you evaluate the product.
The researcher’s critique cuts deeper than the user reaction. Elie Bakouch from Prime Intellect identified the core control problem.
“To be clear, this is a closed source orchestrator on top of closed source models. if before you didn’t control the models, now you don’t even control which ones are used or how much.”
Elie Bakouch, Research Engineer, Prime Intellect
This is the real question for teams evaluating Fugu for production use. The orchestrator abstracts away model selection entirely. For developers who need reproducibility, auditability, or cost predictability per request, that abstraction is a liability. For developers who need the best available results and do not care which model produced them, it is a feature. The answer depends entirely on your deployment context.
There is also the question of what happens when Sakana rotates the underlying models. A conductor architecture by design can swap models in or out without changing the API contract. That flexibility is part of the product’s value. It is also a source of behavioral drift that developers building on top of Fugu need to account for in their testing and monitoring.
Sakana Fugu vs Claude Mythos: Pricing Comparison and Access
The pricing comparison looks favorable for Fugu on paper. Fugu runs at $5.00 input / $30.00 output per million tokens. Claude Mythos 5 was priced at $10.00 input / $50.00 output, described as less than half the price of the earlier Mythos Preview. Fugu is cheaper than Mythos 5 on both input and output.
But one of these models is available today and the other is not. Claude Mythos 5 was suspended on June 12, 2026 under a US government export control directive. It has been offline since three days after launch. Access before suspension was restricted to Glasswing partners and a limited cohort of biology researchers. Most developers never had a chance to use it.
- Fugu pricing: $5.00 input / $30.00 output per 1M tokens
- Fugu long-context surcharge: prompts over 272K tokens are billed at 2x input / 1.5x output
- Mythos 5 pricing: $10.00 input / $50.00 output per 1M tokens
- Mythos 5 availability: suspended since June 12, 2026, no resumption date announced
- Fugu availability: live as of June 22, 2026, with geographic restriction (no EU/EEA)
The long-context surcharge on Fugu deserves attention for teams working with large codebases or document sets. Prompts exceeding 272K tokens get billed at double the input rate and 1.5x output. For tasks that routinely push into that range, the effective cost per call rises sharply. Teams running Fugu at scale on large-context workloads should model this pricing carefully before committing.
For the majority of developers, the practical comparison right now is Fugu Ultra versus Claude Opus 4.8 or GPT-5.5, not Fugu Ultra versus Mythos 5. Mythos 5’s suspension makes a head-to-head production comparison hypothetical. Fugu wins the availability argument by default, not by capability.
Who Should Use Which
Given the benchmark data, the pricing, and the access situation, here is a clear decision framework.
Choose Fugu Ultra if:
- You need the best available model today and Mythos 5 access is not realistic for your team
- Your use case is code review, paper reproduction, or deep literature analysis where multi-pass orchestration adds value
- You want model-provider redundancy built into a single API call without managing multiple integrations
- Cost is a constraint and $5/$30 fits your budget better than $10/$50
- You are outside the EU/EEA and do not need auditability over which underlying model ran your prompt
Wait for Claude Mythos 5 (or use Opus 4.8) if:
- You need independently verified benchmark scores before committing to a production integration
- Your workflow requires knowing exactly which model processed a given request for compliance or reproducibility
- You are in the EU/EEA, where Fugu is not available
- You expect Mythos 5 to resume availability and your tasks involve complex multi-discipline reasoning where Mythos 5’s benchmark lead is directly relevant
- You are building on biology, biosecurity, or life-sciences research where Mythos 5’s access pathways may reopen before general availability does
Use Fugu (standard, not Ultra) if:
- Latency matters more than raw benchmark quality in your application
- You are prototyping and want access to frontier-class results without managing multiple API keys
- Your prompts stay under 272K tokens consistently, keeping you below the long-context surcharge threshold
The Bottom Line

In the Sakana Fugu vs Claude Mythos comparison, Mythos 5 leads on every benchmark where direct data exists, but it is not a model you can use right now. Fugu Ultra beats every individual model it orchestrates and delivers genuinely strong results on code review and reasoning tasks. The benchmark caveat is real: all Fugu numbers are vendor-reported and unverified by third parties. That gap will close or widen once independent evaluations run. For developers who need strong results today, Fugu Ultra is the most capable option currently available. For teams willing to wait on Mythos 5 resumption and who need verified scores before committing, patience is justified. The better model, by the numbers, is Mythos 5. The only available model, by circumstance, is Fugu.