Editor’s Note: Anthropic has released Claude Opus 4.8, achieving first place in five out of six core benchmarks while maintaining the same price; Claude Code now features dynamic workflows, and the next-generation Mythos-level model is already in market expectations.
More noteworthy than mere performance improvements, this release marks Anthropic’s shift toward positioning “trustworthiness” as a core selling point of its cutting-edge models.
In the code honesty test, Opus 4.8 significantly reduced its rate of missing its own errors; in Claude Code, it can orchestrate multiple sub-agents and introduce adversarial self-checks before delivering results. Together, these changes point to a practical concern: when AI moves from chat windows into real workflows, users are often less worried about the model failing to complete a task than about it providing a seemingly complete, fluent, and self-consistent answer despite making mistakes.
Therefore, Opus 4.8 signifies more than just a model upgrade—it sends a clear industry signal: competition among cutting-edge models is shifting from merely chasing benchmarks to competing on reliability, verifiability, and the ability to expose errors. For enterprises and professional users, the next threshold for AI will increasingly depend on whether a model is worthy of trust.
This is also the prerequisite for agents to truly become usable. Models need to accomplish more tasks and earn users' trust to entrust them with more important and complex tasks.
The following is the original text:
Anthropic today released Claude Opus 4.8, which secured first place in five out of six benchmarks listed on its release card.
The most critical change I’m focused on is: in Anthropic’s code summary honesty test, Opus 4.7 failed to flag its own errors in 19.7% of cases, while Opus 4.8 reduced this to just 3.7%. For the same task, its ability to detect errors in its own work improved by roughly five times. Anthropic summarized this in its announcement as a “4x” improvement. Regardless of the exact calculation, this is the key factor determining whether you can confidently delegate real work to this model and walk away—and it matters far more than any benchmark score listed on the release card.

What was actually released?
First, the brief version, then the specific numbers:
Reliability has significantly improved. In addition to the code honesty data mentioned above, Opus 4.8 is the first Claude model to achieve a "literal zero" on two diligence tests: it reduced the frequency of erroneously reporting flawed results from 0.25 to 0.00, and decreased the incidence of lazy investigation from 25% to 0%. Overconfident incorrect answers have decreased by approximately 11 times. The tendency to favor its own work—a bias measurable in 4.7—has been eliminated.
Claude Code has introduced dynamic workflows, currently in research preview. Claude now autonomously writes orchestration scripts, concurrently scheduling dozens to hundreds of sub-agents within a single session and running independent adversarial agents to challenge and refute these results before presenting them to you. This is the "Agent Team" concept introduced in Opus 4.6, now transformed into an automated capability.
It leads on its own release card, but not across the board—it won five out of six categories. GPT-5.5 still leads in terminal operation tasks. Additionally, the system card contains some honest setbacks that Anthropic did not include in their presentation slides, which we’ll elaborate on below.
Prices remain unchanged: $5 per million input tokens and $25 per million output tokens, the same as in 4.7. However, the fast mode is now three times cheaper, though it still falls under the premium tier at $10 / $50.
Mythos is coming. Anthropic has explicitly stated that a highly capable, restricted-access Mythos-level model will arrive in the coming weeks. Opus 4.8 is the public gateway to it.
Official Release Card: Benchmark Scenario
Below is the official announcement card, presented in our brand colors.

One metric broke the sweep, and it’s a significant one. On Terminal-Bench 2.1—the benchmark that tests whether models can complete long-range agent tasks via terminal—GPT-5.5 still leads with 78.2%, compared to Opus 4.8’s 74.6%. Anthropic chose to highlight this loss on their release card rather than hide it. The divide between “Agent” and “Craftsman” we noted at GPT-5.5’s launch has not yet been fully bridged: GPT-5.5 remains the stronger pure terminal operator, while Opus 4.8 behaves more like a superior engineer on the tasks that most professional users truly care about—such as real-world coding, expert reasoning, computer use, and knowledge work.
Beyond the card issuance
The release card only displays six benchmarks. The 244-page system card reports over 40 tests, and the most interesting results are not on the slides. The following are noteworthy:
Mathematical ability improved by 27 percentage points. At the USAMO 2026—the U.S. Mathematical Olympiad held this past March—Opus 4.8 scored 96.7%, compared to 69.3% for version 4.7. Since this competition occurred after Opus 4.8’s training cutoff, there is no risk of data contamination. This represents the largest generational leap across the entire card.
Opus 4.8 demonstrates a significant advantage in long-context scenarios. In a million-token graph reasoning test, Opus 4.8 scored 68.1, compared to 40.3 for 4.7 and 45.4 for GPT-5.5. The lead becomes even more pronounced as context length increases and tasks grow more complex.
It’s with multi-agent systems that it truly reaches its peak. A single Opus 4.8 agent lags behind Gemini on web research tasks, scoring 84.3 versus 85.9. However, when an orchestrator coordinates a team of sub-agents, its score rises to 88.5%, achieving the highest result reported to date; a five-agent team can reach the single agent’s best performance in just one-fifth of the time. This is precisely the dynamic workflow capability demonstrated in benchmark tests.
Token efficiency has undergone a qualitative leap. In the most challenging coding benchmarks, Opus 4.8 achieves the same performance as Opus 4.7 at its highest effort setting, even when using the lowest effort setting. This means you can now attain previous peak performance with fewer token costs.
It has crossed a threshold no previous model has reached. On Harvey’s Legal Agent Benchmark, a task is considered successful only if every single evaluation criterion is passed. Opus 4.8 is the first model to rank first under this “all-pass” standard. It passed 89% of individual criteria, but the overall task success rate was only 9.6%, highlighting just how stringent the demands of real legal work are.
There are also honest acknowledgments of regressions. Three metrics are indeed worse than 4.7, as Anthropic has admitted on the system card. GPQA Diamond, the expert science benchmark, dropped from 94.2 to 93.6. The model’s ability to refuse inappropriate requests and resist prompt injection has also declined, making 4.8 more susceptible to manipulation in agent scenarios. Additionally, in a year-long simulated business test, it ended up with only one-third the cash remaining compared to 4.7. None of these points appear on the release card, which is precisely why they deserve to be highlighted.
Where does it stand compared to open-source weighted models?
The release card only compares Opus 4.8 with other proprietary state-of-the-art models. If we expand the scope to include the many inexpensive open-source weight models currently being tested by teams today, the picture almost mirrors the AI industry in 2026: Opus 4.8 leads in capability, but the performance gap between it and free, self-hostable models has narrowed to just a few percentage points, while the price difference remains enormous.

The chart above includes a complete comparison of eight models. DeepSeek's price reflects its permanent 75% discount; Qwen Max's price has not yet been disclosed.
Opus 4.8 wins outright on the coding benchmark. But Qwen3.7-Max, an open-source model you can run yourself, scores 60.6—just about 9 points behind. DeepSeek V4-Pro scores 55.4, and its output cost is roughly one-thirtieth of Opus’s. For high-stakes engineering tasks, a $25-per-million-output-token difference is worth paying. For large volumes of routine work, that gap is increasingly not worth it—and that’s precisely the calculation every serious team is now making.
What does this mean for you?
If you're using Opus 4.7, this is a free upgrade—same price, better data, and significantly more reliable self-assessment. Just switch over.
A more interesting question is: What tasks are you now willing to hand over to it? Every reader has an internal line separating “tasks I can let AI handle” from “tasks I must do myself because I still can’t trust delegating them.” A 4.8 increase in reliability means you can push that line one step further. The model’s improved ability to flag its own uncertainty reduces the cost of silent misassignments and expands the range of tasks worth delegating to it. This is what honesty data truly means in practical use—it matters more than any single score.
This also aligns with what we wrote last week. Anthropic’s own AI Fluency research found that when models produce outputs that appear polished and complete, people are significantly less likely to notice missing context—the answer seems finished, so we stop checking. Opus 4.8 addresses this failure mode from the model side: it’s better at highlighting where a clean, complete-looking answer might still have weaknesses. It can’t replace your judgment, but it gives your judgment something to hold onto.
If you're using Claude Code, try a substantial real-world task this week with dynamic workflows—such as a migration or a comprehensive review of numerous files—while keeping an eye on the token counter. This capability is real, and adversarial self-checking is key to producing more trustworthy outputs. But the cost is real too. This tool is designed for large tasks that individual agents struggle to handle and should not become your default daily choice.
Next: Mythos, coming in a few weeks
The most forward-looking statement in this release isn't actually about 4.8. Anthropic says that Mythos-level models will arrive in the coming weeks, positioning Opus 4.8 as a public step toward it.
You need to understand what this means. Mythos is a restricted frontier model that Anthropic has been internally benchmarking, outperforming the released Opus 4.8 on nearly all metrics: achieving 93.9% on SWE-bench Verified; in cybersecurity tests, it can generate functional exploits for most targets in current browsers, while Opus 4.8 has a success rate of less than 10%. Previously, it was only available to around 52 vetted organizations, priced at five times the standard Opus rate, and treated as infrastructure rather than a general-purpose product.
Therefore, when a more powerful Mythos-level model launches in the coming weeks, it should be understood through the framework of “two markets”: one is the commoditized layer—Opus 4.8—widely accessible, consistently priced, and increasingly challenged by free open-source models; the other is the controlled frontier layer—Mythos—expensive and access-restricted. These are not separate products, but rather different tiers along the same continuous line of capability. The reliability work in 4.8 is precisely what you must build before aiming for the true goal: enabling models to operate with less supervision. And that goal is now just weeks away, not quarters.
How did this line get here?
If you’ve lost track of the past four months, here’s how to understand it: Opus 4.6 introduced the Agent team in February, Sonnet 4.6 brought price compression, Opus 4.7 delivered a leap in reasoning in April, and Mythos remained the faintly visible ceiling above. Opus 4.8 connects these two threads: it continues the narrative arc begun in 4.6 and serves as the gateway to Mythos.
This release cadence itself is the key fact underlying all surface-level changes. The flagship model has progressed from 4.5 to 4.6 to 4.7 to 4.8 within months, and the model you standardize for your team today may no longer be the one you’re actually running by autumn. That’s why it’s more valuable to invest in skills that transfer across models—such as clear delegation and rigorous validation—rather than in techniques specific to any single model.
Benchmark sweeps will generate screenshots for sharing. But the more significant—and more important—change is smaller: this is the first Claude version whose core selling point is no longer just “it’s smarter,” but “you can entrust it with more.” Before agents truly become useful, the entire industry must move in this direction; and this capability is also the hardest to capture in a chart.
Where is your current boundary? Which tasks are you willing to delegate to the model, and which still require your personal involvement? What would need to happen for you to be willing to push that line further forward?
