Can Transformer Architecture Propel the Next Generation of AI Agents?

12/22 2025 439

Author: Li Yue

Editor: Key Focus Editor

On December 18th, the 2025 Tencent ConTech Conference and Tencent Technology Hi Tech Day kicked off, gathering academicians from the Chinese Academy of Engineering, renowned experts and scholars, founders of leading tech firms, and prominent investors to delve into the opportunities and challenges of the intelligent era.

During the roundtable discussion, when the moderator passed the microphone to Zhang Xiangyu, Chief Scientist at Jieyue Xingchen, and asked about the future trajectory of model architectures, this academic heavyweight made a bold assertion: the existing Transformer architecture is ill-suited to support the next generation of AI agents.

Recently, Li Feifei, a Stanford University professor often referred to as the "Godmother of AI," stated candidly in an in-depth interview that the current Transformer architecture may fall short in generating high-level abstractions comparable to the theory of relativity. Within the next five years, the industry will need to discover a new architectural breakthrough to enable AI to transition from mere statistical correlations to true causal reasoning and physical understanding.

Ilya Sutskever, the mastermind behind the GPT series and former co-founder of OpenAI, echoed a similar sentiment in a recent interview. He noted that the "era of scaling"—characterized by a relentless focus on increasing computational power and data volume—is reaching its limits, and the industry is now pivoting back to an "era of research" centered on foundational innovation.

Over the past seven years, from Google's BERT to OpenAI's GPT series, and then to the groundbreaking DeepSeek, nearly all AI models that have made global headlines have been built on the Transformer architecture. It has propelled NVIDIA's market value to unprecedented heights and enabled countless startups to secure substantial funding.

However, even those who understand it best are beginning to question its limitations.

Humanity seems to be standing on the cusp of another paradigm shift. As the marginal benefits of the Scaling Law diminish and trillion-parameter models still struggle to navigate the physical world as adeptly as humans, we must confront a critical question:

Has the Transformer architecture, once hailed as the path to Artificial General Intelligence (AGI), already reached its ceiling?

The Overachiever Who Excels Only in Tests

Prior to 2017, the dominant approach to AI natural language processing (NLP) relied on Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs). These models processed information sequentially, like meticulous readers, making them inefficient and prone to losing track of long-distance semantic relationships.

In 2017, Google's seminal paper "Attention Is All You Need" revolutionized the field.

The Transformer architecture abandoned sequential processing and introduced the "self-attention mechanism." Instead of reading word by word, it could simultaneously focus on all words in a sentence and calculate their correlation weights.

This breakthrough enabled parallel computing, allowing models to demonstrate astonishing emergent intelligence capabilities when provided with sufficient computational power (GPUs) and data. This phenomenon later became known as the Scaling Law.

The combination of Transformer and GPUs was akin to the invention of the internal combustion engine paired with oil, igniting an AI wave on par with the Third Industrial Revolution.

However, Transformer is, at its core, an extreme statistician.

Li Feifei pointed out that one of the most significant breakthroughs in generative AI was the discovery of the "next token prediction" objective function. While elegant, it is also inherently limiting. The core logic of Transformer is probability prediction based on massive datasets. It has read all the books on the internet, so when you jump off a cliff, it knows the next word should be "fall," not "fly."

Ilya offered a metaphor: current models are like students who have practiced for ten thousand hours to win programming competitions. They've memorized all algorithms and tricks, seen every possible exam question, and covered all blind spots through data augmentation. They seem strong and can score high, but they're essentially just retrieving memories.

In contrast, a truly talented student might only practice for a hundred hours but possesses profound intuition and genuine generalization capabilities. Current Transformer models resemble the rote-memorizing overachiever—their performance drops sharply when encountering unfamiliar domains.

Ilya believes this is because models lack certain qualitative factors, causing them to learn to cater to evaluation standards without truly mastering reasoning.

Li Feifei made a similar observation: "Most depictions of water flowing or trees swaying in current generative videos are not based on Newtonian mechanics calculations but on statistical emergence from massive data."

In other words, AI has simply seen countless examples of water flowing and mimicked them. It doesn't understand the intermolecular forces or gravitational acceleration at play.

Transformer is a perfect curve-fitter, capable of infinitely approximating reality but unable to derive the underlying rules. Because it only captures correlations, not causations.

The Curse of Long Contexts and the Absence of Slow Thinking

In 2025, a clear trend in the AI industry is long-text processing. However, in Zhang Xiangyu's view, this could be a trap: "Our current Transformers, no matter how many tokens they claim to support, basically become unusable beyond 80,000 tokens... Even if the context length can be very long, testing shows degradation around 80,000 tokens."

This degradation doesn't mean the model forgets; its intelligence rapidly declines as the text lengthens.

Zhang Xiangyu explained the underlying mathematical limitation—Transformer's information flow is unidirectional: "All information can only flow from layer L-1 to layer L. No matter how long the context is, the model's depth doesn't increase; it only has L layers." Its thinking depth is fixed and doesn't deepen just because the text gets longer.

This resembles Ilya's emphasis on value functions. He pointed out that humans are efficient because we possess intrinsic value functions—you don't need to finish an entire chess game to know losing a piece was a mistake; you get signals during the process.

Current Transformers lack this mechanism. They must lay out all information flat and refer to their entire lifespan's records for every decision. Similar to human fast-thinking intuitive reactions, they blurt out answers but cannot engage in slow, deliberate thinking.

Ilya believes true intelligence isn't just predicting the next token but pre-judging path quality through internal value functions before acting. For future agents to survive in an infinite-stream world, continuing with Transformer's flat-memory architecture isn't just computationally unsustainable—it's logically untenable.

Visual Aphasia and Physical Blind Spots

The Transformer crisis isn't limited to language and logic; it extends to its inability to understand the physical world.

Li Feifei argues: "Language alone is insufficient to build artificial general intelligence." Current Transformers often crudely adapt next-word prediction to next-frame prediction for visual tasks, resulting in videos lacking spatiotemporal consistency.

A deeper contradiction lies in sample efficiency.

Ilya raised a question in an interview: Why can a teenager learn to drive in just ten hours, while AI requires massive data training?

The answer lies in "prior knowledge." Humans possess powerful evolutionary prior knowledge and intuition (value functions constituted by emotions and instincts). We don't need to witness a million car crashes to learn avoidance; our biological instincts give us a natural sense of danger in the physical world.

He Xiaopeng expressed a similar insight at the conference: Books can't teach you to walk; physical world skills must be acquired through interaction.

Current Transformer models lack this world model based on physical and biological intuition. They try to compensate for their lack of physical law cognition by enumerating all data. Ilya pointed out that the dividends of pre-training data will eventually dry up—data is finite. When you scale up 100 times, mere quantitative changes may no longer bring qualitative leaps.

Physical AI needs a "digital container" embedded with 3D structure, causal logic, and physical laws, not just a language model guessing the next frame based on probabilities.

Returning to the Era of Research

If Transformer might be a dead end, where is the path forward?

Ilya offered a macro perspective: We're bidding farewell to the "era of scaling" (2020-2025) and returning to the "era of research" (2012-2020). This isn't a historical regression but a spiral ascension—we now possess immense computational power, but we need to find new recipes.

This new recipe won't be minor tweaks to a single technology but a systemic reconstruction.

Li Feifei's World Labs aims to build models with "spatial intelligence," creating a closed loop of seeing, doing, and imagining. Future architectures will likely be hybrids: a core of highly abstract causal logic (implicit) interfaced with a rich sensory world (explicit).

Zhang Xiangyu revealed a highly forward-looking direction: "nonlinear RNNs." This architecture no longer flows unidirectionally but can cycle, ruminate, and reason internally. As Ilya envisioned, models need human-like "value functions" to perform multi-step internal thinking and self-correction before outputting results.

Ilya believes future breakthroughs lie in enabling AI to possess human-like "continual learning" capabilities rather than static pre-trained products. This requires more efficient reinforcement learning paradigms, shifting from mere imitation (Student A) to experts with intuition and taste (Student B).

If foundational architectures undergo dramatic changes, the entire AI industry chain will face a shakeup.

Current hardware infrastructure, from NVIDIA's GPU clusters to various communication interconnect architectures, is largely tailored for Transformer.

Once architectures shift from Transformer to nonlinear RNNs or other graph-computation hybrid models, specialized chips may face challenges, while the flexibility of general-purpose GPUs will once again become a competitive advantage.

The value of data will also be re-evaluated. Video data, physical world sensor data, and robotic interaction data will become the new oil.

Conclusion

At the end of the interview, Li Feifei made a profound remark: "Science is the nonlinear inheritance of ideas across generations."

We often favor singular hero myths—Newton discovering physical laws, Einstein discovering relativity, Transformer ushering in the AI era. But in reality, science is a river where countless tributaries converge, divert, and flow back.

Transformer is a monument, but it may not be the destination. It has shown us the dawn of intelligence, yet its inherent flaws in causal reasoning, physical understanding, and infinite contexts doom it to be merely a stepping stone on the path to AGI, not the ultimate key.

When Li Feifei says the industry needs new architectural breakthroughs, Ilya declares the Scaling era over, and Zhang Xiangyu states Transformer cannot support the next generation of agents, they're not entirely negating its historical merits but reminding us: Don't fall asleep in comfort zones.

Over the next five years, we may see Transformer gradually recede to become a submodule, while a new architecture—blending spatial intelligence, embodied interaction, and deep logical reasoning—takes center stage.

For tech companies navigating this landscape, it's both a monumental challenge and a rare opportunity.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.