12/11 2024 336
"The fact that Zhipu AI is compelled to 'expand in multiple directions' highlights that in China's AI innovation ecosystem and investment climate, 'speed' often takes precedence over 'depth'."
@TechNewKnowledge Original
On an ordinary winter day in 2024, domestic AI technology once again revolutionized our technological interactions.
On November 29, Zhipu AI staged an impressive technological spectacle at its Technology Open Day: For the first time, an AI distributed red envelopes!
With just three voice commands, the company's CEO Zhang Peng enabled its AI agent, AutoGLM, to traverse multiple applications like WeChat and Alipay, sending out two large red envelopes to both on-site and online audiences.
Behind this scene lies a groundbreaking advancement in AI Agent technology.
Currently, traditional AI assistants are confined to passive responses and single-scenario interactions, whereas Zhipu AI's AutoGLM can actively comprehend complex instructions, collaborate across applications, and accurately execute user intentions.
Apart from sending red envelopes, AutoGLM can also autonomously perform complex, multi-step tasks, such as comparing prices across multiple apps when ordering takeout.
However, this 'surprise' is merely one of the many 'fruits' borne by Zhipu AI this year.
In July of this year, Zhipu AI officially released the fourth generation of its code generation model, CodeGeeX, supporting basic functions such as code completion, code commenting, code repair, and code translation. At the end of July, Zhipu AI also unveiled its video generation model, 'Zhipu Qingying', capable of generating 6-second videos with a resolution of 1440x960.
In October, Zhipu AI launched and open-sourced the end-to-end speech model GLM-4-Voice. Similar to GPT-4's speech functionality, GLM-4-Voice enables real-time speech dialogue, achieving breakthroughs in emotional expression, multilingual capabilities, and allowing interruptions at any time.
It is evident that this year, Zhipu AI has made significant strides in multiple directions, including code, multimodality, and Agent. This comprehensive technological layout underscores Zhipu AI's resolve to keep pace with global AI giants.
However, behind this seemingly all-encompassing progress lies a sobering realization: Given that Zhipu AI's size and capital are inferior to giants like OpenAI, will such broad technological coverage hinder the depth of its research in various fields?
01
Concerns About Multi-directional Expansion
Overall, in this year's AI race, as one of the 'Six AI Tigers,' Zhipu AI, despite its broad layout, has performed relatively 'average' on each front. Its product direction tends to follow trends rather than showcase original breakthroughs.
Take Zhipu AI's recently released flagship model GLM-4-Plus as an example. Positioned as a deep reasoning model akin to OpenAI's GPT-4, it excels in deep reasoning, long text processing, and instruction following, capable of handling more complex mathematical and logical problems. However, this 'power' does not signify an absolute advantage but rather exposes some underlying contradictions.
The contradiction lies in the mismatch between GLM-4-Plus, a 'GPT-4-like' deep reasoning model, and Zhipu AI's own ecosystem positioning.
Zhipu AI's positioning differs from giants like OpenAI. While Zhipu AI does lean more towards the B-end market, this market is not monolithic but comprises diverse levels and types of demands.
The demand for high-performance deep reasoning primarily stems from research, high-tech industries, or specific fields (like programming or scientific computing), which are relatively limited in scope.
Zhipu AI's B-end clients focus on a broader range of industry applications, including finance, education, energy, and communications. These enterprises prioritize cost-effective, easily integrable, and flexible models rather than deep reasoning models that require substantial computational power.
If entering the competition for high-end reasoning models is merely a necessary means for Zhipu AI to showcase its core technological prowess amidst the threat of the 'scaling law' failing, then its layout in multimodal directions reflects a deeper 'loss of positioning'.
02
Beyond Reach: Multimodality
In 2024, Zhipu AI released multimodal voice assistant features, particularly its 'Zhipu Qingyan' system based on the GLM series. By integrating real-time speech, video calls, and multimodal understanding technology, it attempted to expand into new scenarios in the C-end application field.
However, compared to iFLYTEK's 'Spark' large model and ByteDance's 'Doubao' voice AI, Zhipu AI's performance presents some intriguing contradictions.
iFLYTEK has been deeply entrenched in the speech domain for years, with mature speech recognition, translation, and scenario-based applications (like meeting minutes and smart customer service) that have good adoption rates in practical scenarios. ByteDance's 'Doubao' relies on a robust content ecosystem, possessing the potential to apply voice AI to consumer-grade scenarios like social media, entertainment, and short video creation.
With an inferior ecological layout, Zhipu AI's multimodal voice assistant still fails to showcase any notable differences. While its video call function supports low-latency and more natural interactions, its intellectual performance lags significantly compared to text-based interactions, sharing the same common issues as ByteDance's 'Doubao' and iFLYTEK's 'Spark'.
In addition, Zhipu AI also demonstrated ambition in the field of text-to-video generation in 2024. Through its newly released CogVideoX v1.5 model and open platform 'Qingying,' it offers a range of functions from text-to-video (T2V) generation to multimodal integration. Its technological highlights include support for generating 5- to 10-second HD videos, 4K resolution, and multi-channel output (generating multiple videos at once).
However, frankly speaking, compared to the text-to-video large models of ByteDance, Kuaishou, and other major players, 'Qingying' still falls short.
Although it boasts free usage, high definition, and even added AI sound effects in later stages, the videos it generates are peculiar, distorted, and have obvious motion errors.
For example, upon inputting the prompt: 'A vast beach with a humanoid robot and a cat walking together,' the video generated by Qingying shows two robots instead of one, moving in a strange, sideways crab-like manner.
Even more bizarre is the cat in the scene, whose head turns into a tail as it walks, as if its organs have switched places.
03
Helplessness Amidst Price Wars
The aforementioned phenomena of 'broad but not deep' reflect a deeper issue: Zhipu AI seems to be wavering between the B-end and C-end directions.
Taking video generation as an example, ByteDance tightly integrates MagicVideo-V2 into platforms like TikTok and Douyin, enabling bidirectional empowerment of technology and commerce. Similarly, Kuaishou can embed video generation into its short video platform.
The short video sector is naturally closest to the C-end market and has the most affinity.
Currently, from an ecological layout perspective, Zhipu AI's overall strategy leans more towards the B-end market, serving clients in finance, education, energy, and manufacturing. Most of these collaborations focus on scenarios requiring high technological support and private deployment, such as industrial process optimization and intelligent customer service.
However, Zhipu AI's multi-directional strategy this year seems to indicate that it hopes to expand the ToB market while also creating a multimodal interactive super app for the C-end, forming a 'two-pronged' strategy.
With overall resources inferior to both OpenAI and BAT giants, this strategy ultimately leads to a dispersion of resources, making it difficult to form a prominent competitive advantage in one direction.
In reality, this multi-directional strategy reveals a 'helpless breakthrough' amidst commercialization difficulties.
According to the 'China Large Model Bidding Project Monitoring Report,' from January to September 2024, Zhipu Huazhang won 22 large model projects with a disclosed bid amount of 24.723 million yuan. These 22 winning projects are primarily distributed across industries like communications, finance, energy, education, and science, with central state-owned enterprises as the primary clients.
In terms of the number of large model bidding projects won, Zhipu Huazhang ranks in the first tier alongside iFLYTEK and Baidu. However, the 'cost' paid by Zhipu Huazhang to secure these projects is not insignificant.
This 'cost' is the extreme price war.
This year, to counter price pressure from competitors, Zhipu AI has lowered its model invocation prices to the lowest levels in the industry. For instance, the price for GLM-4-Flash is only 0.06 yuan per million tokens, compared to OpenAI's GPT-4 Turbo version at 10 USD per million tokens, a difference of over a thousand times. Within a year, Zhipu AI's price has dropped from an initial 0.5 yuan per thousand tokens to the current price, a decrease of nearly 10,000 times.
This aggressive pricing strategy has further compressed profit margins. As a result, to survive, Zhipu AI, as a large model vendor, must rely on funding.
In the past six months, capital's attitude towards domestic large model vendors has gradually cooled. If large model vendors want to secure new rounds of funding, the most important thing is to demonstrate their commercialization capabilities.
Such 'capabilities' are reflected in specific businesses through the emergence of one technological marvel after another.
Over the past few months, Zhipu AI has successively released the AI-generated video model Qingying, the emotional speech model GLM-4-Voice, and the AI assistant tool AutoGLM, all in an attempt to attract market attention by pursuing technological hotspots.
However, looking at the entire AI industry, even amidst the bottleneck of large model commercialization difficulties, AI enterprises are not without other options besides a 'multi-directional attack' strategy.
At a time when large models have not yet achieved significant profitability in the C-end market, have there been AI enterprises that have maintained their composure, focused on specific directions, and made breakthroughs that exceed industry limits?
The answer is certainly yes, and Anthropic, a fierce competitor of OpenAI, is a good example.
Compared with other large companies with multi-line layouts, Anthropic clearly focuses on interpretability of mechanisms and AI alignment issues. Its research objectives are highly concentrated, such as improving the safety and ethics of AI through the concept of "Constitutional AI" to make the behavior of its models more transparent and controllable. This focus not only enhances the depth and quality of its scientific research resources but also attracts capital with a willingness to invest in this field in the long term, including the FTX Foundation of Sam Bankman-Fried and Google Cloud.
The reason why Anthropic can achieve this, while Zhipu AI is forced to "advance on multiple fronts" reflects a deeper reality: in China's AI innovation ecosystem and investment environment, "speed" often takes precedence over "depth".
This is not merely a matter of corporate choice but a product of the entire innovation ecosystem.
The dilemma faced by domestic large model manufacturers like Zhipu AI in generally "chasing hot spots" is essentially a "prisoner's dilemma": every company knows the importance of in-depth cultivation, but under fierce market competition and capital pressure, they are compelled to choose a more aggressive strategy. Behind this phenomenon lies the fact that China's technological innovation ecosystem has not yet fully understood and respected "slow variables".