01/12 2026
479
Interpretation: The Future of AI-Generated Content
Key Highlights
Klear Framework: Introduces a unified audio-video generation framework capable of handling both joint generation and unimodal generation tasks simultaneously.
Single-Tower Architecture: Utilizes a unified DiT (Diffusion Transformer) module and Omni-Full Attention mechanism to achieve tight alignment between audio and video.
Progressive Multi-Task Training: Introduces a training strategy ranging from random modality masking to joint optimization, along with multi-stage curriculum learning, enhancing model robustness and understanding of the physical world.
Large-Scale Densely Annotated Dataset: Constructs the first large-scale audio-video dataset with dense captions and introduces an automated data construction pipeline.
Figure 1. Klear, a unified audio-video generation framework, provides high fidelity, strong semantic and temporal alignment, and reliable instruction following in both joint and unimodal settings, with robust OOD generalization capabilities. Across tasks (T2AV/TI2AV/TI2V/T2V/T2A), its performance is comparable to Veo-3 among open-source models.
Problems Addressed
Audiovisual Asynchrony: Existing non-commercial models often exhibit asynchrony between sound and visuals (e.g., lip movements).
Unimodal Degradation: Joint generation often sacrifices the quality of individual modalities (video-only or audio-only).
Data Scarcity: Lack of high-quality, strictly aligned audio-video paired data with detailed descriptions.
Weak Instruction Following: Existing models lack flexibility in handling complex instructions.
Proposed Solutions
Architectural Design: Abandons traditional cascade or dual-tower designs, adopting a fully unified single-tower Transformer structure that enables interaction between audio and video tokens at all levels.
Data Engineering: Develops an automated pipeline including video/audio quality filtering, scene segmentation, human/non-human voice classification, and multi-model collaborative dense annotation (using tools like Whisper, SenseVoice, Qwen2.5-Omni).
Applied Technologies
**Flow Matching**: Used as a denoising objective for training generative models.
Omni-Full Attention: A full attention mechanism allowing complete visibility between audio and video tokens in the sequence, promoting deep fusion.
3D VAE & Audio VAE: Compresses video using a 3D variational autoencoder (3Hz) and audio using Audio-VAE (43Hz).
Multimodal RoPE: Multimodal rotary position encoding for handling positional information across different modalities.
Achieved Results
SOTA-Level Performance: Significantly outperforms existing methods (e.g., Universe-1, Ovi) on multiple tasks such as T2AV (text-to-audio-video) and TI2AV (image-to-audio-video).
Comparable to Commercial Models: Among open-source models, its performance rivals closed-source commercial models like Veo 3.
High-Quality Alignment: Achieves high-fidelity lip-sync and sound effects generation matching actions (e.g., musical instrument playing, singing).
Klear Prerequisites
Problem Definition: The goal of this work is to achieve audio and video generation through a single model given various prior conditions. The denoising network is denoted as , and the text condition as . Let and represent the audio and video latent variables at time step , respectively. Here, denotes the final time step of pure Gaussian noise. During inference, performs recursive denoising operations from to to produce the final generated result , as shown below:
Conditional Flow Matching: This work adopts Flow Matching as the denoising objective. The model learns to transform pure noise into the velocity field of the latent data distribution . In practice, we perform linear interpolation to construct the distribution at time step . Given condition , the model is trained to predict the target velocity :
Where , , and .
Latent Encoding: The model accepts four inputs: video, video-related text, audio-related text, and audio. Video-related text represents video captions, while audio-related text includes audio captions and speech transcripts. The video is encoded using a 3D causal visual encoder from CogVideoX. We use Qwen3-8B Embedding as the encoder for audio and video descriptions.
Single-Tower Architecture with Full Attention
Figure 2. Overview of Klear. The model accepts four inputs: video, video-related text, audio-related text, and audio. Each input is encoded separately by its respective encoder and then fed into MM-DiT. The MM-DiT module outputs latent variables for video and audio, which are then decoded into video and audio, respectively.
Single-Tower DiT: To ensure thorough audio-video fusion, we adopt a single-tower architecture. As shown in Figure 2, following the design of Stable Diffusion 3, we employ a multimodal diffusion Transformer (MM-DiT) that takes sequences from all modalities as input and performs full attention. Specifically, there are four inputs: video, video-related text, audio-related text, and audio. Each input type is encoded separately by its respective encoder into latent variables, which are then input into MM-DiT. The MM-DiT module outputs latent variables for video and audio in two streams, which are subsequently decoded to complete video and audio generation.
Mixed-Dimension Rotary Position Embedding (MixD-RoPE): Another key architectural innovation is Mixed-Dimension Rotary Position Embedding (MixD-RoPE). As shown in Figure 2(d), to enhance positional information in videos introduced by various aspect ratios and durations, we apply 3D RoPE encoding to video embeddings across three dimensions (time, width, and height). This 3D RoPE combines absolute and relative positional dependencies in the video. For the audio modality, we adopt a compatible temporal 1D position encoding, with its position IDs initialized by adding one to the maximum temporal position ID of the video modality. Thus, we construct a MixD-RoPE sharing temporal position IDs between video and audio modalities.
Omni-Full Attention: Previous works may adopt separated spatial and temporal attention to reduce computational complexity, as seen in UniForm. However, as described in CogVideoX, such separated attention mechanisms require significant implicit information transfer, substantially increasing learning complexity. Other works customize two Transformer towers for audio and video generation separately (e.g., AV-DiT, SyncFlow, JavisDiT, TAVGBench). However, they typically employ multi-stage training methods that are complex and resource-intensive. Both towers must first be pre-trained separately and then fine-tuned together, increasing training time and resource consumption. To achieve more efficient training and effective modality fusion, we adopt a 3D text-video-audio hybrid full attention mechanism. As shown in Figure 2, within the MM-DiT module, the hidden states of video, video-related text, audio-related text, and audio are first scaled and normalized, then concatenated for attention calculation:
The attention values are subsequently split into independent hidden states, processed through scaling and normalization, residual connections, and feedforward networks, and then input into the next MM-DiT module. As a result, unified attention across all input modalities is achieved in joint full attention.
Multi-Task Progressive Training Strategy
Random Modality Masking: To learn universal and robust audio-video representations for joint generation, we train the generative model on a broad task spectrum. Thus, we propose selectively adjusting the masks for queries and keys of audio and video modalities. If we restrict queries and keys to only video embeddings and video description embeddings, the model degrades into a T2V (text-to-video) model. Similarly, limiting queries and keys to audio embeddings and audio text embeddings results in a T2A (text-to-audio) model. Through this approach, the model can handle not only joint generation but also maintain unimodal generation capabilities. Considering the scarcity of high-quality audio-video paired data, our method provides an alternative for training T2VA models. Klear is first pre-trained on T2V and T2A tasks, then fine-tuned on audio-video paired data to construct a T2VA model. The learning objectives for audio and video generation are shown in Equations (7) and (8), respectively:
Where extracts audio tokens from the combined noise representation, while extracts visual tokens. In summary, and represent unimodal tasks for T2A and T2V. To learn generalizable and robust audio-visual correlational world knowledge, we also incorporate multiple tasks such as T2AV, I2V, and I2AV. Thus, the overall multi-task learning objective is as follows:
Progressive Training Strategy: To efficiently train AV joint generation, we adopt a progressive multi-task learning framework and apply random modality masking at all stages:
Stage I: Pre-training. The model is pre-trained on a large-scale, multi-scenario data corpus to acquire atomic generation capabilities across all tasks, including cross-modal semantic alignment, temporal synchronization, high-fidelity audio synthesis, and precise visual feature construction. This ensures basic capabilities for unimodal and joint generation and provides a solid foundation for subsequent post-training.
Stage II: Specialized Post-training. The model then undergoes specialized training for weaker capabilities and tasks. Guided by evaluation metrics, we adaptively rebalance data distributions across scenarios and tasks to strengthen underperforming capabilities while maintaining overall performance.
Stage III: Quality-Refined Post-training. Finally, the model is fine-tuned on manually curated high-quality datasets to refine generation fidelity and enhance robustness in complex scenarios, thereby improving perceptual realism and overall generation quality.
Dataset Construction
The dataset in this paper contains automatically annotated samples. It includes solo voice, multi-voice, singing, and natural sound clips, with an overall retention rate of 27% after filtering.

Dataset Filtering
Video Filtering and Scene Segmentation: Video quality is filtered by modeling dynamic quality (subject motion ratio, camera stability), static quality (sharpness, aesthetics, color saturation), content naturalness (absence of excessive effects/watermarks), and safety. We discard videos with low resolution, low SNR/MOS, or exceeding 20% silence. Then, we apply scene segmentation to ensure each sample contains only one scene.
Audio Filtering and Post-Processing: Audio data is filtered by removing samples with low SNR, MOS, abnormal clipping, distortion, or noise, ensuring less than 20% silence, high fidelity, and consistent formatting. Then, we evaluate audiovisual consistency using Synchformer for temporal alignment and ImageBind for semantic alignment, ensuring high synchronization in both temporal and semantic dimensions.
Audio-Guided Data Segmentation
The dataset is partitioned by audio type, separating human voice from non-human voice clips to form sound segments. From the voice subset, we create singing, single-speaker voice, and multi-speaker voice segments, then apply dense captions to each voice segment.
Dense Annotation and Integration
Dedicated models for speech transcripts, audio captions, and video captions are used to annotate each segment, including metadata and detailed content. For speech and singing, speaker attributes (e.g., gender, age) are extracted, while sound segments receive only audio captions. Transcription is performed using Whisper-Large-v3, SenseVoice, and Qwen2.5-Omni; audio captions using Qwen2.5-Omni and Gemini 2.5-Pro; and detailed video labels using video expert models. All annotations are merged into unified dense captions.
Experiment
Experimental Setup
Model Scale: Klear comprises 26 billion (26B) parameters, with a feedforward dimension of 4096 for flow matching.
Network Architecture: It consists of 32 layers of Joint Diffusion Transformer, incorporating multimodal RoPE.
Encoders: The text encoder utilizes a 1024-dimensional TTS text encoder, while the caption encoder employs Qwen2.5-7B.
VAE Settings: Audio-VAE: Processes 44.1 kHz input waveforms, generating 43 Hz embeddings (downsampled by 1024x relative to the input sampling rate). Video-VAE: Handles videos of varying resolutions and frame rates, producing 3 Hz embeddings (spatiotemporal compression) with 16x compression in both height and width dimensions.
Training Details: Utilizes the Adam optimizer with an initial learning rate of .
Results Comparison and Qualitative Analysis
This section demonstrates Klear's advantages across multiple dimensions through qualitative and quantitative analyses:
Lip-Sync Accuracy: Klear generates lip movements tightly synchronized with speech, including natural matching of breathing patterns and facial expressions.
Emotional Expressiveness: The generated videos not only align lip movements but also exhibit emotions consistent with speech intonation (e.g., excitement, contemplation). In contrast, baseline models like Universe-1 and Ovi often exhibit distorted expressions.
**Singing and Rap**: In singing and rap scenarios, Klear precisely controls the alignment of pitch, rhythm, and breathing. For example, vibrato and melisma naturally match facial expressions.
AV Synchronization: Background music and sound effects (e.g., instrumental performances) are strictly aligned in time with video content, enhancing immersion.
Image to Audio-Video: In the TI2AV task, Klear maintains high identity consistency of the input image while generating reasonable camera movements, whereas baseline models often suffer from identity drift.
Quantitative Comparison:


Although the table data cannot be fully displayed, the text mentions that the Single Tower (this work) significantly outperforms the Dual Tower architecture in metrics such as ID retention (0.80 vs 0.62), MOS score (93.11 vs 62.02), and audiovisual consistency (Sync-conf 6.787 vs 3.762).
Ablation Experiments
Architecture Effectiveness: By comparing Single Tower and Dual Tower architectures, the design of inputting audio and video features into a unified mm-DiT branch with Omni-Full Attention significantly enhances inter-modal alignment.
Conclusion
Klear, a novel unified Transformer architecture for high-fidelity audiovisual joint generation. By introducing Omni-Full Attention, Klear seamlessly integrates video, audio, and their corresponding textual conditions within a single stream, achieving exceptional audiovisual synchronization and fine-grained semantic alignment. To facilitate robust multi-task learning, we designed a progressive training strategy incorporating random modality masking, enabling the model to flexibly switch between joint and unimodal generation (e.g., T2V, T2A, TI2AV) while maintaining high-quality outputs. Additionally, we constructed the first large-scale audiovisual dataset with detailed and strictly time-aligned descriptions, addressing the critical scarcity of high-quality paired data in this field. Extensive experiments demonstrate that Klear significantly outperforms existing open-source methods in terms of generation quality, instruction-following ability, and cross-modal consistency, achieving performance comparable to state-of-the-art closed-source models (e.g., Veo 3). Our work paves the way for more unified, scalable, and semantically consistent multimodal generation systems.
References
[1] Klear: Unified Multi-Task Audio-Video Joint Generation