Tremble, Humans: AI Continues to Accelerate at Breakneck Speed

06/15 2026 372

The next challenge for large models is to make physical AI a reality.

Indeed, AI is still accelerating at breakneck speed.

In 2016, deep learning surged but stalled barely a year later. By 2026, large models, which have been booming for four years, still show no signs of hitting their limits.

At the 2026 Beijing Academy of Artificial Intelligence (BAAI) Conference, Guangzhui Intelligence observed efforts across models, hardware, software, and products to transition AI from the digital to the physical world.

On one hand, the Scaling Law remains a stable driver, propelling the continued development of large language and multimodal models. The AI industry has now entered the stage of pursuing world models. However, unresolved issues such as current technical approaches and data mean that exploration will likely take at least another 3-5 years.

On the other hand, breakthroughs in Agents have accelerated AI's deployment in real-world scenarios. As Agents reach the stage of usability, the industry is advancing their applications in fields like healthcare and conferences. To transition Agents from usable to highly functional, hardware-software synergy has become crucial. At the BAAI Conference exhibition, chip manufacturers dominated, with nearly all leading domestic AI chip companies in attendance.

"We are standing at a new historical inflection point. Artificial intelligence is no longer just a tool for transforming individual industries; it is becoming the underlying force reshaping the world. AI Coding, autonomous agents, and self-evolving models are expanding the possibilities of AI creation. World models, embodied intelligence, and robots are extending intelligence from the digital to the physical world," said Wang Zhongyuan, Dean of the Beijing Academy of Artificial Intelligence.

What exactly is happening in this wave of "underlying force restructuring"?

On the first day of the BAAI Conference, guests provided the answer: AI is evolving from "just chatting" to "getting things done." The Scaling Law continues, with world models—whose technical directions have yet to converge—becoming the focus of the next phase. Meanwhile, agents are progressing from usable to highly functional, though many issues remain to be optimized.

AI hasn't hit its technical ceiling yet—it's also learned to self-evolve.

Over the past year, as high-quality internet text data has been nearly exhausted, a pessimistic sentiment has permeated the industry that the "Scaling Law is reaching its limits."

At multiple forums during the BAAI Conference, the question of whether the "dividends of the Scaling Law are shrinking" was frequently raised, with many guests negation ing (negating) this notion.

"I still firmly believe that scaling is far from over," said Wang He, founder and CTO of Galaxy General. "Looking back today, the Scaling Law hasn't failed; it's just become more diversified."

Scaling continues to play a role in a series of newly released large language models. Analyzing Anthropic's recently released Fable 5, Luo Fuli from Xiaomi noted that the model itself is a product of scientifically advancing scaling. It represents an expanded outcome achieved by combining three dimensions: parameter scale, synthetic data, and reinforcement learning in large models.

"We speculate that Fable 5's parameter scale is several times larger than the current largest open-source model. Additionally, significant computational power has been invested in Test-Time Scaling (inference-time expansion) or reinforcement learning. Furthermore, synthetic data generated by humans and agents has brought data scale to new levels," said Luo.

In the multimodal domain, the model performance improvements brought by scaling are equally significant. Zhu Jun, founder and chief scientist of Shengshu Technology, stated that data quality, model size, and large-scale training all contribute to model enhancement. Building on improved foundational model capabilities, models can more efficiently learn physical laws and understand 3D scenes.

While scaling remains effective, the gradual maturation of AI Coding and the accelerated deployment of agents have made AI's self-evolution trend increasingly apparent, upgrading from code writing to independently completing product iterations and updates.

"Much of the digital world is fundamentally constructed through code. Substantial progress in AI Coding, making it mainstream, means AI could gradually take over everything in the digital world," said Wang Zhongyuan.

Globally, using AI for product updates has become the norm.

Overseas, over 80% of merged code in Anthropic is completed by Claude. Domestically, companies like Tencent and Baidu have delegated code work to AI. Li Jingqiu, chief architect of Baidu's Dumate product, told Guangzhui Intelligence that 90% of Dumate's code is AI-generated, with the product going from initiation to internal testing in one week and completing the release version in another week.

"We felt early this year that agents might experience explosive growth, so we made extensive plans centered around agent applications," Li Jingqiu told Guangzhui Intelligence. "But we didn't expect the urgency—almost as soon as we finished planning and started developing product lines, the general-purpose agent Longxia went viral, driving massive user demand."

"If the model determines an agent's capabilities, then Harness determines the upper limit of those capabilities," said Li. "The challenge lies in further clarifying, verifying, and providing feedback on issues beyond the model's foundation."

For example, relying solely on the model to understand problems has limitations. Harness must refine and enrich a user's simple one-sentence instruction, enabling the model to better grasp requirements. This requires Harness to demonstrate intent understanding capabilities, design the next task workflow after receiving the assignment, and then deploy the model for execution. This process may involve manual intervention and correction, followed by pre-completion checks.

The World Model: The Next Critical Battleground for Large Models

As breakthroughs extend beyond the digital realm, world models have become the next critical battleground for large models.

"No world model has truly impressed us yet by solving various problems in the real physical world," said Wang Zhongyuan.

For world models in their early development stages, the industry has yet to reach a complete consensus on the technologies involved. Moreover, with technical approaches still divergent, a series of pressing issues remain unresolved. Take data as an example: whether video data, simulation data, or real-world physical data is needed is still unclear, as Wang Zhongyuan noted.

Take Galaxy General as an example. Wang He introduced their application of synthetic data on-site.

"Before the WAM (World Action Model) paradigm emerged, we conducted extensive experiments with synthetic data for grasping tasks within the VLA paradigm," said Wang. "Using 1 billion frames of simulation data, we proved that scaling data to this extent enables zero-shot learning—grasping any object given in the real world."

Regarding the development of world models, the Beijing Academy of Artificial Intelligence predicts that "several more years are needed," with the next three to five years marking a phase of continuous evolution and iteration for world models.

Within a few years, the industry will see diverse world models emerging from various technical approaches, each with its strengths.

Take multimodal world models as an example. Zhu Jun noted that video models are closely tied to world models, as the latter require three core capabilities: understanding and interpreting states, predictions, and actions. Among currently accessible training data, video data is most relevant to world models.

Given the divergence in technical approaches and the lack of industry consensus, the Beijing Academy of Artificial Intelligence categorizes world models into four types:

The first type is language-centric world models, which map other modalities and capabilities into linguistic space, including large language models, VLMs, and VLAs.

The second type is pixel-centric world models. Video generation essentially predicts the next frame, but video generation models are not equivalent to world models—though they are related. This year's potentially highly popular World Action Model (WAM) evolves around pixel-centric approaches.

The third type is 3D structure-centric world models, with 3D reconstruction representing a pure three-dimensional world.

The fourth type is vision-representation-centric world models.

Currently, the Beijing Academy of Artificial Intelligence is exploring a "fifth" approach—a fusion of language-centric and vision-representation-centric models, known as latent space representation. This involves compressing text, images, and other information into a vector space to represent various states in the real physical world.

"In the future, unified latent space modeling will extend beyond visual space to encompass all-modal latent spaces, likely representing the next viable path for world models," said Wang Zhongyuan.

At the conference, the Beijing Academy of Artificial Intelligence introduced its world model under development—WuJie·Physis-v0.1. Centered on physical space modeling and predicting the next physical state, it positions itself as the world's first general-purpose world foundation model, emphasizing four key capabilities: "physical correctness, action causality traceability, long-term consistency, and generalizability."

Currently, the model is still in training. The Beijing Academy of Artificial Intelligence will continue sharing progress in the second half of the year and open-source the model after training completion.

From "Usable" to "Highly Functional": Agents Face More Challenges

On the model side, progress in world models drives the realization of physical AI. On the product side, agents have become key products bringing AI into everyday life.

Since 2025, dubbed the "first year of agents," some impressive agent products have emerged, showing signs of explosive growth. However, the overwhelming popularity of "Longxia" this year still caught everyone off guard.

"We felt early this year that agents might experience explosive growth, so we made extensive plans centered around agent applications," Li Jingqiu told Guangzhui Intelligence. "But we didn't expect the urgency—almost as soon as Baidu finished planning and started developing product lines, the general-purpose agent Longxia went viral, driving massive user demand."

Compared to last year, when agents were primarily in an execution state, this year's agents have become more proactive and capable, helping users perform more complex tasks independently.

At this year's BAAI Conference, the Beijing Academy of Artificial Intelligence also released four vertical-specific agents: BAAI Cardiac Agent, the world's first cardiac MRI-assisted diagnostic agent, which aids doctors in decision-making by integrating multimodal capabilities and medical expertise; AREX, an autonomous research agent for scientific research; SoulAgent, an agent that helps users listen to conferences in real-time and capture key points; and a risk discovery agent for harmful protein acquisition.

Taking the conference-listening agent as an example, Guangzhui Intelligence tested its ability to summarize content from different sessions. SoulAgent provided a simple summary of the conference content. While not as comprehensive as official minutes, the core points were accurate, making it suitable for situations where parallel sessions overlap in time.

However, agents still face numerous technical issues that require further optimization. An Yang, Chair Professor at Nanyang Technological University, mentioned that sustaining agent capability improvements currently hinges most on context engineering aspects, such as memory and orchestration.

At the agent sub-forum, Harness (literally "horse harness," referring to the engineering framework or environment built around agents)—a term rarely noticed last year but highly popular this year—became one of the most frequently mentioned keywords.

"If the model determines an agent's capabilities, then Harness determines the upper limit of those capabilities," said Li Jingqiu. "The challenge lies in further clarifying, verifying, and providing feedback on issues beyond the model's foundation."

For example, relying solely on the model to understand problems has limitations. Harness must refine and enrich a user's simple one-sentence instruction, enabling the model to better grasp requirements. This requires Harness to demonstrate intent understanding capabilities, design the next task workflow after receiving the assignment, and then deploy the model for execution. This process may involve manual intervention and correction, followed by pre-completion checks.

In short, like a real personal assistant, every detail in this process requires product refinement of Harness to further enhance an agent's execution effectiveness.

Currently, agents are still in their early development stages. It is evident that this industry has significant room for improvement, with advancements in model capabilities and engineering details further empowering agents to handle tasks.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.