08/28 2024 373
Every technological era has its own "entry point" and "driving force."
In the PC era, browsers and search engines were the primary entry points, with users interacting through keyboards and mice. In the mobile internet era, apps and app stores became typical entry points, with users accessing the internet world through fingers and touch screens. In the current AI era, the industry has embraced voice interaction as a crucial entry point, offering a richer, more natural, and convenient interaction experience.
Throughout history, companies that seize these entry points and drive the development of the times can gain the initiative in competition and achieve long-term growth momentum. Examples include Google in the PC era and Apple in the mobile internet era.
Therefore, many major companies are now deeply invested in voice interaction, aiming to occupy strategic positions in the AI era. Among them, OpenAI and iFLYTEK are the two most notable companies, both domestically and internationally.
In May this year, OpenAI released GPT-4o, demonstrating more robust voice interaction capabilities such as faster response times and more natural speech. Unfortunately, GPT-4o is not yet available to domestic users, preventing most people from experiencing it firsthand.
What many people don't know is that iFLYTEK in China has not only achieved a voice interaction experience comparable to GPT-4o but also allows people to experience it firsthand.
On August 19th this year, iFLYTEK launched its Spark Superhuman Interaction Technology, achieving significant breakthroughs in response and interruption speed, emotional perception and resonance, controllable voice expression, and character role-playing. This technology will be open to all users on the Spark App by the end of August, allowing ordinary users to experience it personally.
At iFLYTEK's 2024 mid-year performance briefing, Yidian Finance observed iFLYTEK's Secretary to the Board Jiang Tao personally demonstrating the Spark Superhuman Interaction Technology, providing a more intuitive view of its operational experience.
While iFLYTEK invests heavily in research and development, its marketing efforts are inadequate. In fact, this technology can profoundly impact industry transformation. Meanwhile, iFLYTEK is building up its technical potential, which is expected to unleash strong development momentum in the future.
01
The "Ideal" and "Reality" of Voice Interaction
In 2014, the film "Her," which tells the story of a human-AI romance, gained immense popularity and won the Academy Award for Best Original Screenplay.
In the movie, Theodore, the protagonist, writes love letters for people who struggle with expressing emotions. He uses a voice-controlled device to dictate letters, print them, and listen to songs, receive emails, and news through voice commands. The film sparks imagination when Theodore meets Samantha, an AI robot with a warm voice who understands and cares for him deeply. Through long-term voice interactions, Theodore falls in love with Samantha, embarking on a human-AI relationship.
Over the past decade, the sci-fi scenarios in this movie have gradually become a reality, with various voice interaction products and technologies continuously evolving to enrich users' experiences.
However, many users still feel a gap between their imagined and actual experiences due to common pain points in voice interaction technologies, including slow response times, difficulty in empathy, lack of personalization, and challenges in endpoint detection.
In short, many current voice interaction technologies still have a strong machine-like feel, lacking sufficient anthropomorphic qualities and emotional value. This leads to poor user experiences and hinders industry development, necessitating companies to address these pain points and drive industry progress.
Currently, iFLYTEK is an essential driving force. Its Spark Superhuman Interaction Technology significantly enhances user experiences in four areas: response and interruption speed, emotional perception and resonance, controllable voice expression, and character role-playing. In summary, it offers not only speed in response but also warmth in emotion, providing additional emotional value.
1. Speed in Response
During voice interactions, users expect faster responses to achieve an "on-demand" effect. They also hope for quick re-responses after frequent interruptions. However, current mainstream voice interaction applications typically require 2-2.5 seconds for responses, causing noticeable delays and longer response times after interruptions, affecting users' interaction rhythms and smart experiences.
The Spark Superhuman Interaction Technology offers an initial impression of "speed," reducing response times to 0.9 seconds, barely noticeable. Additionally, users can interrupt or interject at any time, and the technology still responds promptly. This means users can obtain a more realistic conversational experience through the technology.
2. Warmth in Emotion
In voice interactions, timely responses with cold, robotic tones diminish users' desire and enthusiasm for interaction, as no one wants to engage with an emotionless machine. Traditional command-based voice technologies rely on recognizing specific speech sounds, lacking emotional perception. In contrast, the Spark Superhuman Interaction Technology enhances emotional intelligence, judging users' emotions (happiness, sadness, anger, fear, etc.) and recognizing nonverbal signals like coughs and pet noises, fostering deeper emotional connections.
While recognizing emotions, the technology also responds emotionally, with flexible expression methods that control dozens of emotions, styles, dialects, and automatically adjust speech rate, tone, and mood, making conversations more heartfelt. Furthermore, it allows users to switch between various characters like Sun Wukong, Crayon Shin-chan, and Peppa Pig, enhancing the joy of interacting with different personas.
In essence, while traditional voice interaction technologies feel more machine-like, the Spark Superhuman Interaction Technology feels more human, significantly improving user experiences through long-term technological development and accumulation by iFLYTEK.
02
The Evolution of Voice Interaction: Technology is Key
The evolution of voice interaction mirrors the history of technological iterations.
The earliest voice interaction technologies date back to the 1960s, based on rules that analyzed and generated speech according to grammatical rules, responding with predefined sentences. This method had low intelligence and flexibility, limited to specific tasks like weather inquiries and ticket bookings, struggling with more complex commands.
In the 1990s, voice interaction technologies entered a new phase based on statistical models. These models relied on probability theory to generate appropriate responses based on context, handling more instructions and adapting to various scenarios.
In 2006, with the rise of deep learning, voice interaction technologies underwent a qualitative leap, automatically learning complex speech signal features, ushering in the DNN (Deep Neural Network) era. For example, RNN (Recurrent Neural Network) processes sequential data with long-term memory capabilities, enhancing recognition accuracy by handling continuous speech features.
Since then, voice interaction technologies have continued to evolve, with CNN (Convolutional Neural Network) and DFCNN (Deep Fully Convolutional Neural Network) enhancing user experiences. iFLYTEK has been at the forefront of this technological evolution.
In 2012, iFLYTEK introduced BN-feature and NDD-HMM deep learning solutions on its input method and voice open platform, becoming the first Chinese company to commercialize deep learning speech recognition, boosting recognition accuracy from 60% to approximately 88% in real-world scenarios.
Today, the Spark Superhuman Interaction Technology employs a unified neural network for end-to-end speech-to-speech modeling, a proven powerful technical approach. Traditional speech recognition systems comprise multiple modules like acoustic models, language models, and pronunciation dictionaries. End-to-end modeling integrates these modules, mapping raw speech signals directly to final texts, enhancing voice interactions in several ways:
Firstly, it simplifies traditional speech recognition systems, reducing integration difficulties between modules. Secondly, it better handles noise and variations in speech signals, improving system robustness against external interferences. Additionally, it offers faster training and inference speeds, suitable for real-time scenarios. Behind the technical advantages of Spark Superhuman Interaction lies iFLYTEK's continuous R&D investments and accumulations.
According to iFLYTEK's 2024 mid-year financial report, revenue reached 9.325 billion yuan, a 18.91% year-on-year increase. Notably, R&D investments amounted to 2.19 billion yuan, up 32.23% year-on-year, accounting for 23.5% of revenue.
Among corporate development factors, marketing and promotion are "fast variables" with short-term effects but instability and low barriers to entry. In contrast, technology and R&D are "slow variables" requiring substantial upfront investments, akin to pushing a stationary wheel that gains momentum with continued effort, triggering a "flywheel effect" and transforming into outstanding technologies, products, and competitive advantages. iFLYTEK's launch of Spark Superhuman Interaction Technology exemplifies this; sometimes, "slow" means "fast."
In fact, the large model capabilities behind Spark Superhuman Interaction Technology hold immense potential.
03
Looking Ahead: Large Models Reshaping the Voice Industry
Today, "large model+" evokes the same excitement as "internet+" did in its time.
Under the wave of large models, many industries are ripe for reinvention, including automotives, robotics, consumer electronics, and home appliances. iFLYTEK's end-to-end and hardware-software integration of the Spark large model across cloud, edge, and end scenarios fulfills diverse large model requirements across complex scenarios, capturing diverse industry benefits.
Let's start with automotives. In the first half of this year, China produced and sold 13.891 million and 14.047 million vehicles, respectively, maintaining the global lead. Exports totaled 3.48 million vehicles, up 25% year-on-year. Eight of China's top 10 automotive exporters collaborate with iFLYTEK. The future of automotives lies in intelligence, and large models will enhance user experiences in smart cockpits and autonomous driving, fueling China's automotive industry growth.
Since 2011, iFLYTEK pioneered in-vehicle voice localization, making it a standard in the Chinese automotive market but still limited by single languages overseas. The Spark Voice Large Model offers seamless 72-language/dialect switching and significantly enhances smart cockpit experiences through cloud-edge-end and hardware-software integration solutions. iFLYTEK has also developed a car assistant based on the Spark Large Model, monitoring vehicle conditions in real-time and precisely answering user queries.
iFLYTEK collaborates with over 90% of China's mainstream domestic and joint venture automakers on smart automotive products. Its automotive business continued strong growth, with first-half revenue reaching 350 million yuan, up 65.49% year-on-year.
Like automotives, robotics represents a significant future trend.
Especially promising is the humanoid robot market. According to the Humanoid Robot Industry Research Report, China's humanoid robot market will reach approximately 2.76 billion yuan in 2024 and 75 billion yuan by 2029, accounting for 32.7% of the global total, ranking first worldwide.
Large models' chain-of-thought reasoning capabilities significantly improve robots' understanding of complex tasks, providing commonsense-based task decomposition and planning. The combination of embodied perception and decision models further enhances humanoid robots' multimodal perception and understanding in real-world scenarios.
In complex task decomposition, open-scene object recognition, multimodal perception, and understanding, the Spark Large Model notably elevates humanoid robots' intelligence. At the 2024 World Robot Conference, iFLYTEK showcased the latest progress in its "large model + embodied AI" humanoid robots, doubling overall motion performance, achieving over 95% success in complex task decomposition, and enhancing interaction and motion capabilities.
In terms of industrial empowerment, iFLYTEK's AI Superbrain platform supports 420 robot companies and deeply engages with 15,000 robot developers, collaborating widely with humanoid robot enterprises like UBTECH, Unitree Robotics, Zhiyuan Robotics, and Yinxingtongyong, promising substantial imagination space and robust industrial momentum.
Beyond robotics, the large model wave is spreading to consumer electronics and home appliances.
The consumer electronics market, encompassing smartphones, smart notepads, and e-readers, is vast. The 2024 Digital Economy Report projects 39 billion IoT devices globally by 2029. Making each device smarter with large models can enhance user experiences and create larger market spaces. The China Chamber of Commerce for Import and Export of Machinery and Electronic Products predicts China's smart hardware market will reach 1.4031 trillion yuan in 2023 and 1.5033 trillion yuan in 2024.
Take iFLYTEK's Smart Notepad X3 as an example, equipped with the latest Spark AI technology, it offers efficient office functions like voice-to-text conversion, smart note organization, and multilingual translation. Users can instantly convert voice information from meetings and speeches into text records, significantly boosting productivity and fueling iFLYTEK's smart hardware business growth. According to its 2024 mid-year report, iFLYTEK's smart hardware revenue reached 900 million yuan, up 56.61% year-on-year, far outpacing the industry average.
As a new wave of trade-ins emerges, the home appliance market is seeing fresh growth. Integrating home appliances with large models can create smarter home lives and new growth opportunities for manufacturers and technology providers.
For example, the TV voice assistant equipped with iFLYTEK Spark's cognitive big model will be upgraded into an all-round family center, capable of easily handling tasks such as schedule management and smart home control. It also allows children to directly communicate with the TV voice assistant to practice oral English, acquire knowledge, and create new educational scenarios. By integrating the capabilities of iFLYTEK Spark's cognitive big model, Samsung has enabled its TV voice assistant to possess deep understanding, content generation, and knowledge Q&A abilities, significantly enhancing the user experience.
Today, the Spark big model is becoming the preferred choice for implementation in various key sectors such as education, healthcare, energy, automotive, home appliances, and robotics. It not only explores more possibilities for the entry points of the AI era but also applies these technologies to real-world scenarios, bringing real technological benefits to users and creating economic benefits for enterprises. Meanwhile, it also fuels its own development momentum.
04
Conclusion
The concept of 'strategic potential energy' is introduced in the book 'Basic Logic,' likening it to lifting a rock to the top of a mountain to store potential energy. When the rock rolls down, its potential energy is converted into kinetic energy.
Currently, iFLYTEK is at a stage of accumulating strategic potential energy, requiring overcoming difficulties and significant investment. As various technologies are further developed and implemented, the strategic potential energy will continue to be converted into development momentum, making its future highly anticipated.