Surprise! Claude Opus 4.8 Makes a Late-Night Debut with Major Advances in Decision-Making; Agent Capabilities Confirmed

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

05/29 2026 517

In May, all AI models are stepping up their Agent functionalities.

After much anticipation, Claude Opus 4.8 has finally been launched.

In the early hours of May 29th, Beijing time, Anthropic officially unveiled Claude Opus 4.8. At first glance, Opus 4.8 may seem like a minor upgrade from Opus 4.7, and this is reflected in the official performance metrics. For example, on Terminal-Bench 2.1, GPT-5.5's score of 78.2% still outperforms Opus 4.8's 74.6%.

(Image Source: Anthropic)

However, Anthropic's real game-changer is not just Claude Opus 4.8, but the new Agent capabilities introduced alongside this flagship model. These include effort control in Claude.ai and dynamic workflows in Claude Code.

In fact, Anthropic is no longer solely focused on making Claude smarter; instead, it aims to enhance Claude's ability to accomplish tasks effectively.

Let's delve into the specific performance of Anthropic's latest flagship model, Claude Opus 4.8.

According to official performance metrics, Opus 4.8 outperforms Opus 4.7, GPT-5.5, and Gemini 3.1 Pro in several areas, including Agentic Coding, Agentic Computer Use, Knowledge Work, and Finance Agent. On SWE-Bench Pro, Opus 4.8 achieves a score of 69.2%, higher than Opus 4.7's 64.3%. On OSWorld-Verified, it reaches 83.4%. On GDPval-AA, it scores 1890, and on Finance Agent v2, it achieves 53.9%.

(Image Source: Anthropic)

In simple terms, the core enhancements of Opus 4.8 lie in coding, terminal usage, computer operation, knowledge work, and financial analysis. More bluntly, Opus 4.8 is not just an upgrade for 'question-answering' but for 'agentic execution'.

Over the past year, the main criticism of coding agents has not been their inability to write code, but their overconfidence. For instance, you might ask them to perform a task, and they claim it's done, but the tests haven't actually passed. They might overlook defects in the code they generate and confidently assure you that 'everything is fine.' In the context of question-answering, this would be another instance of AI 'confidently catching you off guard.'

While such issues might only result in a subpar experience for chat products, they can lead to production accidents for agents.

Because the essence of an agent is not to answer but to act. The most concerning aspect of an acting model is not its lack of capability but its unawareness of its own limitations. Therefore, the improvements in Opus 4.8 are crucial; it is more willing to acknowledge uncertainties and pause when evidence is insufficient, waiting for you to provide complete information before proceeding. The officials even mentioned that Opus 4.8 significantly reduces the likelihood of code defects going unnoticed compared to its predecessor.

From early feedback, partners such as Cursor, Devin, Databricks, legal AI, financial analysis, and browser agents have noted:

Tool calls are cleaner, task progression is steadier, and long-term context retention is better, making it more suitable for unattended or semi-unattended complex tasks.

Additionally, the official ClaudeDevs account provided a detailed explanation of dynamic workflows: Claude Code can now temporarily write orchestration scripts and then launch a large number of coordinated subagents in parallel to handle complex tasks. The officials also explicitly stated that such workflows are suitable for tasks that are difficult for a single agent loop to complete, such as service-wide bug hunting, large-scale migrations, and design stress testing.

(Image Source: Anthropic)

Bun author Jarred Sumner stated that dynamic workflows are currently one of the cutting-edge methods for reliably using agents to complete medium to large-scale projects. He mentioned that during Bun's rewrite in Rust, dynamic workflows and adversarial code review played crucial roles.

It's clear that Opus 4.8 is not just a standalone, highly capable model; more importantly, it serves as the core execution model within the Claude Code agent system.

Meanwhile, the new capabilities released by Anthropic alongside Opus 4.8 are also intriguing. For instance, Claude.ai now features effort control, allowing users to dictate how much 'effort' Claude should expend on a task, with options for low effort (faster, more cost-effective) and high effort (deeper, more suitable for complex tasks). Opus 4.8 defaults to high effort, but if you want to save on tokens, it's best to manually switch to low effort.

Throughout May, the AI landscape has witnessed almost every major player showcasing their strengths.

OpenAI continues to enhance Codex, demonstrating the construction of self-improving tax agents using Codex. Google unveiled a complete set of AI agent development toolchains at I/O. GitHub, Cursor, and OpenAI are all vying for positions in enterprise-level AI programming agents. Replit Agent is beginning to integrate with automated QA. Luma Agents are used for large-scale generation of authentic UGC ads. Alibaba Cloud is also promoting DataWorks AI data agents and 'all-weather AI labor'.

Domestic models are also continuing to iterate at a high frequency. For example, Qwen3.7-Max emphasizes programming capabilities, Zhipu GLM-5.1 High-Speed Edition focuses on API speed, MiniCPM5-1B and BitCPM-CANN continue to advance towards end-side, low-bit, and low-cost directions. Companies like SenseTime and Tencent Hunyuan are also rapidly updating and iterating.

Meanwhile, a price war is quietly brewing.

DeepSeek has once again lowered its prices, and Xiaomi's MiMo large model has entered the market at an extremely low price. On the surface, this appears to be competition over API pricing, but in reality, it's all for the sake of agents, as agents consume an enormous number of tokens.

If it's just for chatting, a session might only consume a few hundred to a few thousand tokens. However, agents are different; they need to read context, break down tasks, write plans, call tools, execute code, check results, fix errors, and sometimes even launch multiple subagents to work in parallel. Claude Code's dynamic workflows are a prime example; the officials themselves warn that it is powerful but expensive, rapidly consuming a large number of tokens.

Therefore, the token price war is not just to make chatting cheaper but to enable high-consumption forms like agents to run. Even Anthropic has had to reduce the price of its fast mode to one-third of its predecessor to cope with such high consumption.

(Image Source: Anthropic)

It may seem like everyone is just updating their models step by step, but it appears that one crucial point has been overlooked: the core of these models is no longer just chatting but who can better integrate into real workflows.

In the past, the main battleground for large model competition was conversation—who could answer more naturally, who had stronger reasoning, who had longer context, and who had better multimodality. Now, the main battleground is shifting to agents.

The core of agent competition is not single-shot answers but continuous execution. It requires models to break down tasks, call tools, manage context, handle permissions, control costs, review outputs, and stay on track in complex environments for extended periods.

This is why the officials of Opus 4.8 did not emphasize conversational capabilities but focused on agentic coding, computer use, knowledge work, and financial analysis. Because Anthropic is well aware that the most valuable model calls in the future may not necessarily occur in chat windows but in IDEs, terminals, browsers, data platforms, enterprise backends, and various automated processes.

(Image Source: Anthropic)

From this perspective, dynamic workflows might be even more important than Opus 4.8 itself. Because it propels Claude Code from being 'an AI programmer' to 'an AI engineering team'. In the past, when you asked a model to perform a task, it was essentially a single model looping within a single context. Now, it can break down tasks, allocate subagents in parallel, have different agents verify each other, and then summarize the results.

Overall, the major model showdown in May is not just about 'models getting stronger' but about 'models being allowed to do more'.

Although Opus 4.8 is positioned as Claude's flagship model, it won't be a 'sensational' model release.

It's more like a roadmap presented by Anthropic to the market. In this roadmap, models cannot just pursue being smarter but also need to be more stable; tasks cannot just be completed in a single round of conversation but need to be continuously advanced; AI cannot just provide answers but also needs to explain the process, review results, control costs, and formalize workflows. These are all points that all future large models will need to focus on.

Thus, we can see that Opus 4.8 is responsible for pushing Claude's decision-making and long-term execution capabilities forward. Effort control allows users to actively adjust between quality, speed, and cost. Dynamic workflows propel Claude Code from a single coding agent to an engineering collaboration system that can break down tasks, schedule subagents, execute in parallel, and review results.

What is Claude becoming? The answer is clear: Claude is evolving from a chat model into an engineering collaboration system.

Going forward, the competition among large model companies will increasingly shift away from 'who can talk better' and focus on reliably completing complex tasks more cheaply, supporting high-frequency calls, and truly packaging models, tools, workflows, security, and cost control into a productivity system.

In this direction, Anthropic has already submitted its first answer sheet.

The name Opus comes from the Latin word for 'work' and is often used to describe a composer's masterpiece (magnum opus, meaning 'the greatest work'). In classical music, an Opus number follows, representing the composer's most important creations. Beethoven's 'Moonlight Sonata' is Op. 27, and his 'Symphony No. 5' is Op. 67. These are not arbitrary; they are the culmination of blood, sweat, and tears.

From the perspective of leading the AI industry into the workflow era, Claude Opus 4.8 can indeed be considered a masterpiece.

Anthropic, Claude, workflow, Agent, Claude Opus

Source: Leikeji

All images in this article are from the 123RF licensed image library. Source: Leikeji

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links