Doubao Dives Deep into Smartphones, Qianwen Bets on Glasses: Who Will Seize the 'Power Button' for Agents?

04/22 2026 379

The Battle for Entry Points in the Agent Era: Doubao and Qianwen Take Different Paths

"A tool becomes a tool only when it is in the hands of the user." Heidegger's words ring equally true when applied to AI hardware today.

The question is: when the 'hand' of large models reaches out, will users choose to hold it in their palms (smartphones), place it on their noses (glasses), or keep it perpetually at their ears (earbuds)?

According to an exclusive report by *Z Finance*, ByteDance has internally decided to suspend the Doubao AI glasses project. To understand this choice, we must first answer a more fundamental question: Why are large model companies making hardware?

Per *LatePost* reports, Doubao's large model on Volcano Engine now exceeds 120 trillion tokens in daily average calls, a 4x increase in six months. Data from the National Data Bureau shows China's daily average token consumption has surged 300x in 1.5 years. At GTC 2026, Jensen Huang declared tokens the most critical commodity in the future digital world.

Yet this token consumption boom obscures a more fundamental issue: Where are these tokens triggered, and through what interface do they enter users' lives?

For the past two years, the answer has been smartphone screens and chat dialogues. OpenClaw's viral success propelled Agents from developer tools to mass-market users, driving demand for "AI execution anytime, anywhere." As large model competition shifts from generative Q&A to task execution, the operational chain requires a physical anchor closer to users' bodies.

Doubao chose operating systems as its entry point, entering the AI smartphone race through "OS-level collaboration" with manufacturers to gain core permissions for screen UI recognition and simulated human operations. Qianwen opted for glasses: its AI glasses launched with initial "AI task completion" capabilities supporting mobile recharges, bike-sharing scans, parking payments, and voice-ordered food delivery. These functions share a common trait: AI now finishes tasks in the real world.

The essence of these two paths represents different engineering solutions to the same problem: Who should serve as the physical interface for Agent execution chains?

Two Advantages, Two Extensions

Large model companies making hardware are essentially answering a question derived from token economics.

The past two years of AI competition centered on model capabilities and pricing. Price wars have driven token unit costs down nearly 300x from three years ago. Yet this collapse hasn't made AI spending predictable because Agent applications have increased token consumption per task from dozens to hundreds of times ordinary conversations. According to *Tencent Technology*, a 6-person team at Asia-Pacific e-commerce tech firm Branch8 spent $2,400 in their first month using Claude Code, only reducing costs to $680 after eight weeks of intensive optimization. Managing token expenditures has become a specialized skill.

This cost structure fundamentally alters competition logic for many AI products. Whoever controls Agent entry points controls token consumption sources. Entry point density depends on device proximity to users and frictionless activation. This drives Doubao and Qianwen toward hardware: establishing physical nodes at the token consumption chain's front end.

Qianwen's AI glasses team tracks a key metric: user interaction frequency, or how often AI completes tasks for users. After Quark Glasses S1 launched, user interactions increased 6x compared to third-party smartphone AI assistants. Face-worn AI sees more frequent use because perception remains constantly online with near-zero activation friction.

For Agents, this sustained presence means richer contextual accumulation and more task-triggering opportunities.

In April 2026, Qianwen AI glasses' first OTA update introduced "AI task completion" capabilities by integrating with Taobao Flash Sales and Alipay, supporting mobile recharges, bike-sharing scans, parking payments, and voice-ordered food delivery. The product definition shifted: AI evolved from answering questions to completing tasks.

Doubao's chosen path runs equally deep but in a different direction. Last December, Doubao AI smartphone assistant entered the AI phone race through "OS-level collaboration" with manufacturers, gaining core permissions for screen UI recognition and simulated human operations.

Tests show Doubao can automatically complete a 12-step manual operation across three apps ("compare KFC combos, place order, and send screenshot") in the background, requiring human intervention only for payment—72% faster than manual operations.

Currently, widespread Agent usage habits among mass users still need time to develop. Doubao and Qianwen's current hardware investments represent early positioning for an anticipated demand peak. This follows classic platform logic: secure perception nodes first, then data flows and usage volumes will naturally converge there as Agents mature.

But platform logic requires devices to already be on users when demand arrives. This explains Qianwen's expansion beyond glasses to rings and earbuds—no single form factor can cover 24/7 perception needs; only a matrix can.

Doubao and Qianwen's hardware paths both extend from core strengths but represent different optimal forms for those advantages.

The suspension of Doubao AI glasses project followed reasonable internal judgment: mainstream paradigms (large frames, photography, voice, translation) have already been standardized by Ray-Ban Meta. In 2025, Meta smart glasses sold over 7 million units globally, capturing 85.2% market share. Under this landscape, "can we make it?" is no longer the question.

Qianwen's choice stems from equally clear logic. Alibaba's app ecosystem—mobile recharges, food delivery, parking payments—can directly integrate into Agent execution chains through glasses, repackaged as AI-native interactions. For companies lacking this ecosystem foundation, glasses remain merely face-worn voice assistants; for Alibaba, glasses serve as practical nodes connecting existing apps with new touchpoints.

Doubao integrates deeply into smartphone OSes to establish Agent entry points within its traffic distribution domain. Qianwen bets on a wearable device matrix to repackage Alibaba's app ecosystem into AI-native interactions.

Looking longer term, today's product spectacles and sales volumes won't determine the final outcome. Two years from now, when Agents integrate into workflows like networks, the only moat will be user habits around specific entry points.

How On-Device Inference Reshapes Cost Structures

The hardware entry point competition ultimately circles back to a fundamental question: Where do tokens originate, where do they go, and who pays?

Token prices are transparent, but users cannot discern how much "intelligence" each token contains. In April, AMD AI Strategy Director Stella Laurenzo's analysis of 6,852 Claude Code sessions revealed a sharp decline in Claude Opus 4.6's reasoning depth since late February. *Tencent Technology* also reported that "file reading times before each code edit" plummeted from 6.6 to 2.0—a 70% drop.

These changes occurred without prominent user notifications. Many developers only suspected "the model got dumber" after noticeable code quality declines.

More concealed is how cache hit rates impact actual costs. One developer's week-long tracking of Claude Code showed 91% of tokens normally came from cache hits, priced at one-tenth of standard inputs. If caches fully failure , input costs would surge 5.7x.

This cost structure underpins one core value proposition of on-device models. After initial deployment, on-device inference has near-zero marginal costs, eliminating cache hit uncertainty and cloud peak pricing fluctuations. For hardware devices frequently triggering Agent tasks, this advantage grows with usage density.

Google DeepMind's April release of Gemma 4 redefined on-device model capabilities. Its E2B and E4B models activate only 2 billion and 4 billion effective parameters during inference, processing 4,000 input tokens spanning two independent skills within 3 seconds under LiteRT-LM framework. E2B and E4B natively support function calls covering core reasoning paths for Agent workflows. With a 128K token context window, they operate within modern mid-range smartphones' remaining memory.

This means an on-device Agent capable of calling external tools and executing multi-step plans now requires only mid-range smartphone memory.

Qianwen currently employs a hybrid architecture combining cloud-based large models with local lightweight agents—a sound solution under current on-device computing constraints. According to *36Kr*, Qianwen's 2026 hardware lineup includes AI rings and earbuds alongside glasses, covering visual interaction, unobtrusive wear, and audio interaction across three dimensions to form an all-weather perception matrix.

This matrix's core value lies in glasses capturing first-person behavioral data streams that feed back into Qianwen's large model iterations. Improved model capabilities then optimize hardware experiences, creating a closed loop. However, models like Gemma 4 are shortening this "current" phase's lifespan. When on-device models can independently complete more Agent tasks locally, the necessity for cloud fallback in high-frequency, lightweight scenarios will continuously decline, altering token consumption paths.

This challenges current AI hardware's dominant cloud-based model in two ways: First, enhanced on-device capabilities reduce hardware reliance on the cloud, making device-side AI more competitively priced. Second, when users complete more Agent tasks locally, the business closed loop relying on cloud data reflow to drive model iterations must redesign its data acquisition paths.

The structural question of how much incremental demand remains cloud-based versus shifting locally will become critical for MaaS business models.

Final Thoughts

As token consumption migrates from conversation to execution layers, with Agents operating apps on behalf of users, the question arises: Should these tasks be billed through cloud computing or completed locally on-device? The answer will shape token consumption structures and influence MaaS revenue models.

Volcano Engine's MaaS revenue target exceeding ¥10 billion has risen alongside model releases like Seed 2.0 and Seedance 2.0, as well as OpenClaw's viral success. Alibaba has established the ATH Business Group. The token wars in the cloud and entry point battles in hardware represent two fronts of the same competition. Whoever establishes sufficiently frequent Agent usage habits on the hardware side gains demand-side initiative in the next phase of cloud-based MaaS growth.

The 2026 AI hardware competition appears as a form factor battle between glasses and smartphones but fundamentally represents early positioning for Agent-era token consumption entry points. No quick conclusions will emerge, as mass users' actual Agent usage habits are still forming, on-device model capabilities continue advancing through models like Gemma 4, and cloud token cost structures shift quietly through cache hit rates, reasoning depths, and pricing strategies.

*New Position* believes the decisive factor will be which company possesses sufficiently dense and high-frequency application scenarios enabling Agents to continuously accumulate context and optimize execution capabilities in real usage, thereby deepening user understanding.

This variable depends more on ecosystem foundations. Qianwen and Doubao's hardware divergence reflects two different ecosystem bases making distinct bets at the same technological inflection point, each seeking answers where they excel most.

*The featured image and illustrations in the text are sourced from the internet.*

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.