07/22 2025
514
Preface:
At the beginning of the second half of 2025, OpenAI, which has always been committed to defining the field of AI, officially launched its Agent Mode solution.
This solution allows ChatGPT to invoke text browsers, visual browsers, and terminal tools in a virtual sandbox to autonomously complete multi-step complex tasks, enabling operations from information retrieval to online shopping, and marking a leap from Chat to Agent.
Author | Fang Wensan
OpenAI Unveils Its Own Agent Mode
Recently, Sam Altman and four OpenAI researchers introduced the upcoming Agent Mode through a live broadcast.
Observing the demonstration, it can be found that the intuitive user experience of this mode is extremely similar to the Manus Mode, which garnered widespread attention a few months ago.
After the user submits a request, the system automatically creates a virtual environment and begins executing the task.
During task execution, the Agent repeatedly requests user confirmation of operation steps and allows users to manually take over the process at any time.
Simultaneously, users can insert new demand instructions during task execution to achieve real-time interaction.
Sam Altman, CEO of OpenAI, expressed that witnessing the ChatGPT agent use a computer to perform complex tasks made him truly feel the existence of AGI. The process of computers autonomously thinking, planning, and executing will produce a significantly different experience.
All operations are completed within the dedicated virtual computer of ChatGPT Agent, which preserves the entire task context information when invoking multiple tools.
The agent can choose to access webpages using a text browser or a visual browser based on needs, perform file download operations, process files through terminal commands, and review output results with the help of a visual browser.
It can also dynamically adjust task strategies to achieve efficient, precise, and rapid execution.
ChatGPT Agent is specifically designed for iterative and collaborative workflows, with interactivity and flexibility far surpassing previous models.
During task execution, users can interrupt the process at any time to further clarify instructions to correct the execution direction or directly change the task goal. The agent will continue to work based on the new information while preserving previous progress.
Similarly, ChatGPT will proactively request users to provide additional details when necessary to ensure that task execution does not deviate from the set goal.
If a task takes longer than expected or stalls, users can choose to pause the process, obtain a progress summary, or terminate the task to extract existing results.
When users install the mobile ChatGPT application, the system will push notifications upon task completion.
Integrated from Operator + Deep Research Tools
According to OpenAI, the Agent Mode can invoke three tools: text browser, visual browser, and terminal. The model has the ability to autonomously select and switch between these tools.
The design of this tool combination is ingenious: the text browser specializes in browsing and retrieving large amounts of text information, while the visual browser is responsible for performing keyboard and mouse operations or reading image information after locating information.
The terminal tool is used to run code, generate files including presentations and spreadsheets, and invoke specific cloud application programming interfaces (APIs).
The new Agent Mode launched by OpenAI is not a brand-new technological innovation but is actually an integration of two tools released by the company in the first half of the year: Operator and Deep Research.
Operator was originally a browser agent tool only available to Pro users, with the ability to analyze graphical user interfaces and perform basic operations.
Deep Research is a deep research and analysis tool that can read a large amount of webpage content and directly generate research reports.
When OpenAI separately promoted these two tools, they found that many users submitted prompts through Operator that were more aligned with the scope of Deep Research tasks, such as [planning a travel itinerary and making reservations].
Meanwhile, Deep Research users strongly demanded the [login to websites and access protected resources] functionality, which Operator already possessed.
These two Agent projects, advancing from different dimensions, ultimately achieved integration, resulting in significant synergistic effects.
This not only avoids the inefficiency of relying solely on the browser graphical interface to process text materials but also significantly shortens the time required to generate in-depth research reports.
Achieving a Critical Upgrade in General Agent Capabilities
Unlike previous iterations of basic large models, the general agent can autonomously invoke multiple tools for task planning and assist users in completing complex operations, including automatically checking the user's calendar, generating editable PPT documents, running code, etc.
ChatGPT Agent can connect to users' platforms such as Gmail and GitHub to obtain information and solve problems, while also accessing various applications through API interfaces.
OpenAI evaluated the model using benchmarks that simulate complex real-world tasks.
Enhanced by Agent technology, the AI intelligence level has been significantly improved.
The model based on ChatGPT Agent scored 41.6% in the HLE benchmark test, nearly doubling the performance of the O3 and O4-mini models.
In the internally built complex economic value knowledge-based task evaluation system, the output quality of ChatGPT Agent reached or even surpassed human levels in about half of the cases, with varying task completion times and significantly outperforming the O3 and O4-mini models.
In the SpreadsheetBench table operation test, the agent achieved significant breakthroughs in its ability to handle complex spreadsheet editing, function application, and formatting, scoring 45.5%, twice the performance of GPT4o, and approaching the level of the ExcelCopilot commercial-grade solution for the first time.
In the field of webpage operations, ChatGPT Agent successfully executed real-world tasks such as account login, page navigation, and data collection in the WebArena test, with performance close to the average human level.
In terms of information retrieval capabilities, the agent set a new current record with a score of 68.9 in the BrowseComp benchmark test. This indicator directly determines its reliability level when autonomously executing tasks.
Conclusion:
OpenAI's official entry may reshape the overall narrative framework of the Agent startup sector.
A few months ago, Manus was hailed as the [hope of domestic Agents]: it was the first to demonstrate the future landscape to the market before the industry fully understood the concept of Agents, proving the real potential of AI to perform complex tasks.
However, in early July this year, the Manus official website quietly closed, and its business in mainland China was fully suspended, with only overseas product lines remaining. This prompted outsiders to re-examine the true survival status of Agent startups.
A few days ago, Zhu Xiaohu publicly asserted that large models will swallow 90% of the Agent market.
Undoubtedly, this phenomenon is intertwined with multiple complex factors such as regulatory policies, compliance requirements, and the capital environment.
But now the question arises: with OpenAI entering the field personally, opportunities for startups in general Agents are dwindling.
References:
GeekPark: "Just Now, OpenAI Released Its Own Agent Mode, ManusStyle"
Machine Heart: "Just Now, OpenAI's General Agent ChatGPT Agent Officially Debuts"
Guokr: "Finally, OpenAI's Agent, But This Time, Not Much Applause"
NetEase Tech: "In the Wee Hours, OpenAI Dives into [General Agent], Did Manus Waste Their Efforts?"