Surpassing Wan-2.1 and MatrixGame! Yume1.5: An Interactive World Generation Model with Real-Time Interaction and Rendering at 12 FPS on a Single GPU Card

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

12/30 2025 555

Interpretation: The AI-Generated Future

Key Highlights

Joint Temporal-Spatial-Channel Modeling (TSCM): Enables infinite context generation while maintaining stable sampling speeds despite increasing context lengths.

Integration of Self-Forcing with TSCM: Accelerates Yume1.5's inference process while reducing error accumulation in long-sequence generation.

Superior Generation and Editing Performance: Through meticulous dataset construction and model architecture design, Yume1.5 achieves exceptional performance in world generation and editing tasks.

Challenges Addressed

Existing video generation and world simulation models face three major challenges:

Limited Universality: Most are trained on game data, making it difficult to generate realistic dynamic urban scenes.

High Generation Latency: The high computational cost of diffusion models restricts real-time continuous generation, hindering smooth infinite exploration.

Inadequate Text Control: Current methods typically support only keyboard/mouse control, lacking the ability to generate random events (e.g., "a ghost appears") via textual instructions.

Proposed Solution

Yume1.5 resolves these issues through systematic optimization across three core dimensions:

Long Video Generation Architecture: Adopts Joint Temporal-Spatial-Channel Modeling (TSCM) to effectively compress historical context.

Real-Time Acceleration Strategy: Combines bidirectional attention distillation (Self-Forcing) with enhanced text embedding schemes to significantly boost inference speed and reduce error accumulation.

Text-Controlled Event Generation: Achieves text-based event triggering through mixed-dataset training strategies and specialized architectural design.

Technologies Applied

Joint Temporal-Spatial-Channel Modeling (TSCM): Compresses historical frames in both temporal-spatial and channel dimensions for long video generation, reducing memory usage while maintaining inference speed.

Linear Attention: Processes channel-compressed features in DiT blocks to enhance computational efficiency.

Self-Forcing Distillation: Trains the model to predict using its own generated (error-prone) historical frames as conditions, improving robustness against error accumulation during inference.

Dual-Stream Text Encoding: Decomposes textual prompts into "event descriptions" and "action descriptions," processed separately to reduce computational overhead.

Achieved Results

Performance Improvement: On the Yume-Bench benchmark, the Instruction Following score reached 0.836, significantly outperforming Wan-2.1 and MatrixGame.

Real-Time Capability: Achieved a generation speed of 12 FPS at 540p resolution on a single A100 GPU.

Long-Term Consistency: Through TSCM and Self-Forcing, the model maintains more stable aesthetic and image quality in long-sequence generation, avoiding the sharp decline seen in traditional sliding window methods.

Methodology

This paper introduces a comprehensive framework that generates interactive, realistic, and temporally coherent video worlds through multi-dimensional systemic innovations. It establishes a unified foundation for joint Text-to-Video (T2V) and Image-to-Video (I2V) generation while addressing key challenges in long-term consistency and real-time performance.

Core contributions include: (1) a joint TSCM strategy for efficient long video generation; (2) a real-time acceleration framework combining TSCM and Self-Forcing; and (3) an alternating training paradigm that simultaneously achieves world generation and exploration capabilities. Collectively, these advancements facilitate the creation of dynamic, interactive environments suitable for exploring complex real-world scenarios.

Figure 1. The Yume1.5 framework supports three interactive generation modes: text-to-world generation from descriptions, image-to-world generation from static images, and text-based event editing. All modes are controlled via continuous keyboard input for character and camera movement, enabling autoregressive generation of explorable and persistent virtual worlds. Demonstration videos are included in the supplementary materials.

Preliminary Architecture

This paper establishes a foundational model for joint T2V and I2V generation using the methodology proposed by Wan. The approach initializes the video generation process with noise. For text-to-video training, text embeddings and are input into the DiT backbone network.

For image-to-video modeling, given an image or video condition , it is zero-padded to match dimension . A binary mask is constructed (where 1 denotes preserved regions and 0 denotes regions to be generated). The conditional input is fused via , then processed by the Wan DiT backbone network. Here, can be viewed as composed of historical frames and predicted frames .

The text encoding strategy in this paper differs from Wan's approach. While Wan processes the entire caption directly through T5, this paper decomposes the caption into Event Description and Action Description, as shown in Figure 3(b), and inputs them separately into T5. The resulting embeddings are then concatenated. The Event Description specifies the target scene or event to be generated, while the Action Description defines keyboard and mouse controls. This method offers significant advantages: since the set of possible Action Descriptions is finite, they can be efficiently precomputed and cached. Meanwhile, the Event Description is processed only during the initial generation stage. As a result, this approach substantially reduces T5's computational overhead in subsequent video inference steps. The model is trained using Rectified Flow loss.

Figure 3. Core components of Yume1.5. (a) DiT block with linear attention for efficient feature fusion. (b) Training pipeline with decomposed event and action descriptions. (c) Adaptive historical token downsampling at varying compression rates based on temporal distance. (d) Block-based autoregressive inference with dual-compression memory management. Long video generation achieved through Joint Temporal-Spatial-Channel Modeling (TSCM).

Given the prolonged video inference duration, the number of frames in the video condition gradually increases, leading to substantial computational overhead. Including all contextual frames in the computation is impractical. Several existing methods aim to mitigate this issue:

Sliding Window: A widely adopted approach that selects contiguous recent frames within a window near the current predicted frame. However, this method often results in the loss of historical frame information.

Historical Frame Compression: Methods like FramePack and Yume compress historical frames, applying less compression to frames closer to the predicted frame and greater compression to more distant frames. This similarly increases the loss of information from more distant historical frames.

Camera Trajectory-Based Search: Approaches like World Memory utilize known camera trajectories to compute the field-of-view overlap between historical frames and the current frame to be predicted, selecting frames with the highest overlap. This method is incompatible with video models controlled via keyboard input. Even with predicted camera trajectories, accurately estimating trajectories under dynamic viewpoint changes remains challenging and often leads to significant errors.

To address these limitations, this paper proposes the Joint Temporal-Spatial-Channel Modeling method, implemented in two steps. This paper considers applying temporal-spatial compression and channel-level compression separately to historical frames .

Temporal-Spatial Compression

For historical frames , this paper first applies temporal and spatial compression: random frame sampling is performed at a 1/32 ratio, followed by Patchify with a high compression ratio. The compression scheme operates as follows:

Here, denotes downsampling along the , , and dimensions of , respectively. Similarly, corresponds to downsampling rates of , , and along the same dimensions, and so on. This paper achieves these varying downsampling rates by interpolating Patchify weights internally within DiT. Compared to YUME, this paper's method of performing temporal random frame sampling reduces the number of parameters in Patchify and the model's computational load. This paper obtains the compressed representation and processes the predicted frame through the original Patchify with a downsampling rate of . The compressed representation is then concatenated with the processed predicted frame , and the combined tensor is input into the DiT block.

Channel Compression

This paper applies further downsampling to historical frames by passing them through a Patchify with a compression rate of , reducing the channel dimension to 96, resulting in . As shown in Figure 3(a), these compressed historical tokens are input into the DiT block. After the video tokens pass through the cross-attention layer in the DiT block, they are first channel-reduced via a fully connected (FC) layer. After extracting the predicted frame , it is concatenated with . The combined tokens pass through a linear attention layer to produce . Finally, another FC layer restores the channel dimension, which is then element-wise added to for feature fusion: , where denotes the number of tokens in .

Linear Attention. This paper's design, shown in Figure 3(a), draws inspiration from linear attention and is straightforward to implement. This paper projects via fully connected layers to obtain query , key , and value representations, then replaces the exponential kernel function with a dot product , where is the ReLU activation function. The computation is defined as follows:

where denotes the number of tokens in . This paper then computes , after which ROPE is applied to and , combined with a normalization layer to prevent gradient instability. Typically, the attention output would pass through a linear layer, so this paper applies the normalization layer before this computation:

Summary. Since the computational cost of standard attention is sensitive to the number of input tokens, historical frames are compressed via temporal-spatial compression and processed alongside predicted frames using standard attention in DiT blocks. Conversely, as linear attention is sensitive to the channel dimension, channel-level compression is applied to historical frames, which are then fused with predicted frames in the linear attention layer of DiT blocks. Through this approach, this paper achieves joint temporal-spatial-channel compression while maintaining generation quality.

Real-Time Acceleration

This paper initially trains a pre-trained diffusion model on a mixed dataset. An alternating training strategy is employed for T2V and I2V tasks. Specifically, the model trains on the T2V dataset in the current step and switches to the I2V dataset in the next step. This approach equips the model with comprehensive capabilities for world generation, editing, and exploration. The resulting model is referred to as the base model.

As shown in Figure 4, this paper first initializes the generator , fake model , and real model using the base model's weights. The generator samples previous frames from its own distribution and uses them as context to generate new predicted frames. This process iterates, sequentially generating and assembling frames to form a coherent video sequence . Then, this paper converts the multi-step diffusion model into a few-step generator by minimizing the expected KL divergence between the diffusion real data distribution and the generated data distribution across noise levels :

where is the forward diffusion at step . The key distinction from DMD lies in using model-predicted data instead of real data as video conditions, thereby mitigating training-inference discrepancies and associated error accumulation.

This paper's method differs from Self-Forcing by eliminating KV caching and introducing Temporal-Spatial-Channel Modeling (TSCM), enabling the utilization of longer contextual information.

Experiments

Experimental Setup:

Base Model: Uses Wan2.2-5B as the pre-trained model. Training Parameters: Resolution 704x1280, 16 FPS, using NVIDIA A100 GPU. Initial base training for 10,000 iterations, followed by 600 iterations of Self-Forcing + TSCM training. Evaluation Metrics: Yume-Bench, including Instruction Following, Subject/Background Consistency, Motion Smoothness, Aesthetic Quality, and Imaging Quality.

Quantitative Results:

I2V Generation Comparison: Yume1.5 scored 0.836 in instruction-following ability, far surpassing Yume (0.657), MatrixGame (0.271), and Wan-2.1 (0.057). Extremely fast inference speed: Generating one block takes only 8 seconds, while Wan-2.1 requires 611 seconds. Long Video Generation Verification: Models using and not using Self-Forcing + TSCM were compared. The results show that as the number of video clips increases (over time), the aesthetic score and image quality score of the model using this technology remain stable in the 4th to 6th clips, while those of the model not using it show a significant decline.

Ablation Study:

Effectiveness of TSCM: After removing TSCM and replacing it with simple spatial compression, the instruction-following ability dropped from 0.836 to 0.767. Additionally, TSCM enables autoregressive inference time to remain stable as context increases (constant per-step inference time after 8 blocks), while the speed of the full-context input method drops sharply.

Conclusion

Yume1.5 is an interactive world generation model capable of generating infinite videos from a single input image through autoregressive synthesis, while supporting intuitive keyboard-based camera control. The framework in this paper addresses three fundamental challenges in dynamic world generation: limited cross-domain generality, high computational latency, and insufficient text control capability.

The key innovations of Yume1.5 include: (1) A joint spatiotemporal channel modeling (TSCM) method that achieves efficient long video generation while maintaining temporal coherence; (2) An accelerated method to mitigate error accumulation during inference; (3) Text-controlled world event generation capability achieved through careful architectural design and hybrid dataset training.

Looking ahead to extending Yume1.5 to support more complex world interactions and having broader application scenarios in virtual environments and simulation systems.

References

[1] Yume1.5: A Text-Controlled Interactive World Generation Model

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links

Surpassing Wan-2.1 and MatrixGame! Yume1.5: An Interactive World Generation Model with Real-Time Interaction and Rendering at 12 FPS on a Single GPU Card

Amazon AWS Architecture Framework Receives Major Upgrade: New "Responsible AI" Perspective Incorporated