12/25 2025
556
Alibaba Tongyi has just launched Fun-Audio-Chat, a sophisticated large audio language model tailored for seamless, low-latency voice interactions.
This model incorporates dual-resolution speech representation, which effectively cuts down on computational requirements without compromising speech quality. Additionally, it employs the Core-Cocktail training approach to maintain the robustness of its text language model capabilities.
Fun-Audio-Chat has excelled in benchmark tests, showcasing superior performance in speech question-and-answer sessions, audio comprehension, speech function invocation, voice command adherence, and emotional voice response.
Key highlights of Fun-Audio-Chat include:
Dual-resolution speech representation: Utilizes an efficient 5Hz frame rate, cutting GPU runtime by nearly half while preserving high speech fidelity.
Outstanding performance: Among models of comparable size (around 8 billion parameters), it tops the charts in major benchmark evaluations.
Comprehensive features: Supports a wide range of functions, including speech Q&A, audio interpretation, speech function calling, voice command execution, and emotional voice response.
The Fun-Audio-Chat architecture comprises three core modules:
Audio input processing: The speech encoder and tokenizer transform raw audio signals into structured representations for both user and assistant.
Multimodal large language model (MLLM): Combines a shared LLM backbone with specialized text and speech refinement heads for token generation.
Speech reconstruction: The detokenizer reconstructs audio waveforms from the generated speech tokens.
Fun-Audio-Chat builds upon existing pre-trained models and undergoes a rigorous multi-stage post-training regimen, utilizing millions of hours of diverse speech data from various domains and tasks. This includes dialogues, multilingual speech, and audio for comprehension tasks, ensuring broad applicability across different scenarios.
The training regimen encompasses:
Pre-alignment: Utilizes large-scale speech-text paired data to align the speech encoder, adapter, and speech refinement head.
Core hybrid training: Employs high-quality speech data synthesized from billions of text tokens for supervised comprehensive fine-tuning.
Multi-task DPO training: Leverages diverse real-world speech data to enhance robustness, audio understanding, and ASR data for stronger comprehension. Instruction-following data (including emotional, stylistic, and prosodic control) refines voice command execution, while voice empathy data boosts emotional understanding and empathetic response generation.
The research team conducted a thorough evaluation of Fun-Audio-Chat's capabilities on widely recognized benchmark datasets, covering speech-to-text, speech-to-speech, audio comprehension, and speech recognition.
In terms of accuracy, among models with approximately 8 billion parameters, Fun-Audio-Chat-8B achieved the highest overall scores on OpenAudioBench (76.61%) and VoiceBench (83.21%).
For speech quality assessment, Fun-Audio-Chat-8B scored 4.37 on UTMOS, indicating exceptional speech quality.
Fun-Audio-Chat excelled in comprehensive audio understanding benchmarks, including MMAU, MMAU-Pro, and MMSU, outperforming models such as Kimi-Audio.
Fun-Audio-Chat-30B-A3B and Fun-AudioChat-8B demonstrated strong competitiveness in voice command execution tasks, covering various dimensions like acoustic attributes, instruction adherence, role-playing, and empathy.
Fun-Audio-Chat-8B performed exceptionally well in both English and Chinese, significantly surpassing open-source models like Baichuan-Audio and Kimi-Audio, while remaining competitive with commercial counterparts.
Despite its impressive performance across multiple benchmarks, Fun-Audio-Chat still faces some challenges that require further refinement. Firstly, in multi-turn dialogues involving complex questions, the model occasionally experiences context memory loss, struggling to retain information from previous exchanges consistently. This limitation is particularly pronounced in scenarios demanding long-context understanding and complex reasoning.
Secondly, the model's voice command execution capability exhibits some variability in expressiveness. While generally effective, there are instances where the generated speech may not fully capture the nuanced emotions, speaking styles, or prosodic variations specified in the instructions, affecting the naturalness and appropriateness of voice responses.
Thirdly, the voice empathy capability shows inconsistency in performance. The model's ability to consistently recognize and respond with appropriate emotional empathy may vary across different scenarios and emotional contexts, impacting the reliability of generating empathetic responses in real-world applications, especially where emotional understanding is paramount.
Researchers noted that these limitations point to crucial areas for future research, including enhancing long-term context management in multi-turn dialogues, improving the stability and expressiveness of voice command execution, and developing stronger, more consistent voice empathy capabilities across diverse emotional scenarios.
References:
https://arxiv.org/pdf/2512.20156
https://huggingface.co/FunAudioLLM/Fun-Audio-Chat-8B