Siri Unlocked the Phone's 'Voice', Doubao Bridges Mind with Hands and Feet

12/03 2025 338

Phones integrated with Doubao take the initial stride.

Author | Gu Nian

Editor | Yang Zhou

On December 1st, ByteDance's Doubao team unveiled a video showcasing a plethora of functionalities offered by the Doubao Mobile Assistant.

From the scenarios depicted in the video, Doubao Mobile Assistant boasts capabilities that span cross-app execution processes, screen content interpretation, multimodal recognition, system-level activation, cross-platform price comparison and order placement, remote vehicle control, and beyond. By activating the memory function, the number of queries required for task execution is notably diminished, showcasing a certain degree of continuous task proficiency.

These capabilities nearly encompass the typical usage scenarios that users have envisioned for an 'AI mobile assistant' over the past several years. Consequently, despite still being in the technical preview stage, it has reignited mainstream discussions around the voice assistant, a feature that has lain dormant on mobile phones for many years.

This mobile assistant isn't a standalone application but is launched through a partnership with mobile phone manufacturers. Presently, the demonstrated model is a product co-developed by Doubao and ZTE, with Doubao leading the product definition and interaction experience of the AI assistant, while ZTE handles hardware R&D and manufacturing.

Just a day after its release, on December 2nd, ZTE Mall indicated that the first co-branded model equipped with Doubao Assistant was already marked as 'sold out'. Regarding inventory and sales data, Nubia publicly stated, 'Currently, we lack sales data as it is the inaugural model and was released in limited quantities.'

On the second-hand platform Xianyu, the phone generally commands a price range between 4200 and 4999 yuan, with a premium ranging from 700 to 1500 yuan over the official price. Doubao clarified that it currently has no intentions to develop its own mobile phone and is instead focusing on collaborating with multiple mobile phone manufacturers for implementation.

The concept of a mobile assistant isn't novel; it has been a part of smartphones for quite some time, yet its positioning has always been ambiguous. Almost all mainstream manufacturers have persisted in incorporating voice assistant access points in system updates, but it has never been a pivotal factor in users' purchasing decisions.

The reasons behind this are straightforward. In the real-world experiences of most users, voice assistants can respond but often fall short in providing meaningful assistance. Over the past decade, the upper limit of voice assistants' capabilities has largely remained confined to recognizing sentences, answering questions, and launching apps, consistently hovering at the conversational interaction interface.

From a chronological perspective, Siri debuted with the iPhone 4S in 2011; Google Assistant was launched in 2016; since then, domestic mobile phone manufacturers have continuously enhanced their localized capabilities. From a user's standpoint, they still belong to the same generation of products: they can comprehend intent but cannot truly execute tasks on behalf of the user.

Recently, the industry has witnessed long-awaited transformations. Initially, Google announced last month that it would discontinue Assistant in 2026 and transition the system-level interaction entry point to Gemini; subsequently, this month in China, ByteDance released a preview version of Doubao Mobile Assistant, enabling AI to perform cross-app operations rather than merely engaging in semantic dialogue.

Although these two events transpired in distinct markets, they both point towards a common direction: mobile assistants may finally be breaking free from the predicament of being verbose yet inept.

01 From Q&A to Cross-App Execution

When introducing Siri, Jobs famously stated that it is not a search engine but artificial intelligence. If voice assistants possess learning capabilities and interact with millions of users over the long haul, theoretically, their performance should increasingly mirror natural language communication rather than repeatedly reverting to 'Sorry, I don't understand what you're saying.'

In 2011, at the iPhone 4S launch event, Siri made its grand entrance, marking the introduction of voice interaction as a system-level capability in smartphones. Siri's full name, 'Speech Interpretation & Recognition Interface,' underscores semantic recognition rather than command execution.

At the time, this question-and-answer mode represented a directional breakthrough in human-computer interaction. However, this trajectory did not continue to advance over the subsequent decade.

Voice assistants, exemplified by Siri, have continuously expanded their functional scope from voice control of phone calls and text messages to system capabilities such as voice subtitles, smart calling, screen recognition, and home automation linkage. Nevertheless, their core capability—understanding and executing tasks—has remained stagnant at the voice Q&A stage with minimal substantive progress.

This is precisely why the demonstration video of Doubao Mobile Assistant has reignited industry attention: it transcends being a mere conversational interface and evolves into an execution interface. The product's core transformation is to elevate the voice assistant from 'information return' to 'task completion,' directly translating user semantics into a comprehensive set of operational paths.

It prioritizes not entertaining users with jokes and riddles but rather focusing on the tasks it can assist in accomplishing.

For instance, in a price comparison shopping scenario, the user simply needs to say, 'Help me compare the prices of this shampoo across all my shopping apps and place an order for the cheapest one.' The assistant will automatically search, compare prices, and apply coupons in apps like Taobao, JD, Pinduoduo, and Douyin Mall. After identifying the lowest price, it will pause at the payment page, awaiting user confirmation to prevent misoperations or unauthorized deductions.

Another example is image processing. When the user says, 'Remove the person from this photo' or 'Clean up the background,' the assistant can automatically identify the target area, invoke image editing tools, and complete the operation without requiring the user to navigate through the app step by step.

Even in more intricate cross-scenario chains, Doubao handles them effortlessly. Users can issue multiple instructions simultaneously, such as 'Subscribe to updates from this podcast and add it to my playlist → Open the front trunk of my Tesla → Make a restaurant reservation for 8:30 PM tonight.' The assistant will sequentially execute the operations in the corresponding apps, linking local apps with offline behaviors.

It took twelve years for mobile assistants to truly transition from answering questions to completing tasks.

02 Native AI Interaction System

A mobile assistant capable of cross-app and multi-scenario execution isn't merely a stack of semantic understanding capabilities. Supporting Doubao Mobile Assistant's functionalities are the simultaneous establishment of two systems: the model's execution planning capability and the system-level native access capability.

Firstly, at the model level, the Doubao model not only performs semantic understanding but also interface interpretation and operation planning. It can recognize text, buttons, layouts, and step logic on the screen to generate stable operational paths. Ultimately, it presents not 'telling the user what to do' but 'the mobile assistant completing a series of clicks and inputs.'

This distinguishes it from traditional voice assistants that remain confined to responding to instructions; it is essentially a set of GUI operation capabilities.

The core of this capability lies in Doubao's performance in reasoning, visual understanding, image creation, video generation, voice, and other aspects, which have reached international first-class standards. Its graphical interface operation capability has even achieved the best results in the industry in multiple authoritative evaluations, enabling it to operate a mobile phone like a human and complete various complex tasks.

Secondly, at the interaction level, Doubao Mobile Assistant isn't a standalone app but has acquired calling permissions at the operating system level through collaboration with mobile phone manufacturers. This means the model no longer operates at the application level but can dispatch system resources and coordinate actions across apps.

In the technical preview version demonstration video of Doubao Mobile Assistant, it is evident that after deep collaboration with mobile phone manufacturers, the AI assistant can seamlessly integrate the Doubao large model into the native interaction system, enabling direct calling at any stage of mobile phone use.

From the demonstration video, it is apparent that the interaction method undergoes a significant transformation after combining these two capabilities:

Users no longer need to copy content or switch apps; they can directly initiate inquiries from any interface. The screen content will be instantly understood. For example, asking questions about a photo like 'Where is this attraction?' or 'From what perspective was it taken?' will prompt the assistant to directly provide information. Capabilities such as voice calls, video calls, and screen sharing in the original Doubao ecosystem are system-level integrated, allowing real-time dialogue with just a double-click of the AI key.

With the superposition of these two system capabilities, Doubao Mobile Assistant is no longer an additional feature but has become a capability deeply integrated into the system's foundation. This marks the first time that large models are not existing as 'plugins' but are beginning to be embedded into system-level interactions, becoming a native AI node in the mobile phone operation chain.

03 The Complete Form of an AI-Native Mobile Phone

From the demonstration content, it appears that the currently displayed capabilities can also be customized and personalized. Doubao Mobile Assistant offers an optional memory function. With user authorization, the assistant can provide execution path planning that better aligns with personal habits based on commonly used personal preferences.

At the same time, Doubao is also exploring a Pro mode for mobile phone operation. Compared to the basic mode, which relies on step-by-step clicking by the GUI Agent, the Pro mode can directly call system tools and plan operation schemes in conjunction with memory data. The key shift in this mode is from executing instructions one by one to grasping the user's true intent as a whole.

The example in the release video illustrates this difference:

When the user says, 'Help me recommend a few gifts for my daughter and add them to the shopping cart,' in basic mode, this is a complex task involving at least multiple search, filtering, and ordering actions. However, in Pro mode, if the child's age and interests are already recorded in the memory, the assistant will directly focus on matching products and add them to the shopping cart without requiring the user to supplement conditions one by one.

Cross-app execution addresses the issue of 'what can be done,' while memory and Pro mode attempt to solve 'what should be done reasonably.'

It is crucial to emphasize that although the demonstration content is authentically recorded, Doubao still categorizes it as a technical preview version. The official cautions that the large model still harbors uncertainties at the current stage and cannot guarantee stable reproduction in all scenarios. This implies that while the direction is clear, there is still a considerable journey ahead before reaching the complete form at the product level.

However, even at this early stage, Doubao Mobile Assistant already demonstrates a relatively comprehensive system design in terms of 'controllability of execution rights.' At least three layers of controllable mechanisms are presented in the current demonstration:

Firstly, task status visualization. When the phone is executing tasks through the assistant, the screen will display dynamic light effects as prompts. Even if the user takes over the operation midway, the screen border will still display task prompts to avoid information asymmetry caused by background operation.

Secondly, the status bar capsule mechanism. When there is no foreground interface, all task progress is displayed in the form of capsules in the status bar. When critical nodes involving payment or authorization are reached, the system will also issue clear reminders.

Thirdly, the information supplementation mechanism. Users can enter the interaction interface at any time during task execution and add necessary information through supplementation entry points to ensure the accuracy and practicality of task results.

The core logic behind these mechanisms isn't to pursue 'maximum automation' but to achieve 'transparent, interruptible, and negotiable' automatic execution capabilities within clear boundaries.

It is noteworthy that the emergence of Doubao Mobile Assistant isn't an isolated phenomenon but aligns with trends emerging in the global smartphone industry.

Google announced last month that it would discontinue Assistant in 2026 and transition the system-level entry point to Gemini. The system-level Agent developed by Gemini in collaboration with Samsung is also undergoing landing tests. A consensus is gradually emerging in the industry: deep collaboration between mobile phone manufacturers and large model manufacturers, rather than manufacturers building voice assistants alone, is the main path for the next generation of mobile phones to implement AI.

Transitioning from an assistant entry point to a system capability, the exploration of AI operating systems in the mobile phone industry may be entering a clearer stage.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.