Xie Saining, Li Feifei, and LeCun Jointly Propose a New Paradigm for Multimodal Large Models, Unlocking the Model's "Super Perceptual Abilities"

11/13 2025 511

On November 7th, Xie Saining, an assistant professor at New York University, unveiled a groundbreaking achievement—Cambrian-S. The co-authors of this work include Li Feifei, often hailed as the "godmother of AI," and Yann LeCun, a Turing Award laureate.

Last year, Xie Saining and his research team developed the Cambrian-1 model, an open-ended exploration project focused on multimodal models for image applications. On social media, Xie Saining shared that before expanding the Cambrian series, they grappled with three fundamental questions:

What defines true multimodal intelligence?

Is the large language model (LLM) paradigm truly suitable for sensory modeling?

Why is human perception so seamless, intuitive, and yet remarkably powerful?

Xie Saining posits that current multimodal models are missing a crucial element. Without first cultivating super perceptual abilities, superintelligence cannot be achieved.

In his perspective, hyper-perception does not hinge on advanced sensors or superior cameras. Instead, it pertains to how digital entities genuinely experience the world, continuously absorbing and learning from a stream of inputs.

The Xie Saining team mapped out the developmental trajectory of multimodal intelligence from the present to the future:

The team experimented with a novel prototype known as predictive sensing. They trained a Latent Variable Frame Prediction (LFP) head on Cambrian-S and leveraged the degree of "surprise" in two distinct ways during inference:

Surprise-driven memory management: This approach involves compressing or skipping non-surprising frames and directing computational resources toward "surprising" frames.

Surprise-driven event segmentation: This technique utilizes peaks in "surprise" to identify event boundaries or scene transitions.

Traditional video Multilingual Language Learning (MLLM) benchmarks predominantly focus on language comprehension and semantic perception, often overlooking the higher-level spatial and temporal reasoning skills required for hyper-perception.

To bridge this critical gap, the team introduced VSI-S, a new benchmark specifically designed to assess these more intricate and sustained aspects of spatial perception. It comprises two components:

VSI-S Recall: Evaluating long-term spatial observation and recall.

VSI-S Count: Assessing continuous counting under varying perspectives and scenes.

The researchers established several experimental conditions for feeding videos into the Cambrian-1 model:

Multiple Frames: The model processes 32 frames uniformly sampled from a video clip. This method represents the standard approach for video input representation in the literature.

Single Frame: The model processes only the middle frame of a given video clip. This condition evaluates the model's reliance on minimal, contextually central visual information.

Frame Captions: Instead of receiving video frames, the model receives captions corresponding to the same 32 uniformly sampled images.

To gain a deeper understanding of performance under these conditions, the team introduced two additional benchmarks:

Blind Test: The model attempts to complete the task using solely the task question. All visual inputs are disregarded, and no visual cues are utilized. This benchmark gauges model performance based on its pre-existing knowledge, linguistic priors, and any biases inherent in the benchmark questions.

Random Accuracy: This metric represents the accuracy achievable through random guessing under a specific task format, serving as the baseline performance standard.

The results revealed that Cambrian-1, an image-based Multilevel Logical Learning Model (MLLM), demonstrated reasonable performance across multiple benchmarks without any video-specific post-training. In some instances, its accuracy even surpassed random levels by 10-30%. This suggests that a significant portion of the knowledge targeted by these benchmarks can be acquired through standard single-image instruction tuning.

However, on two existing datasets, VSI-Bench and Tomato, the model performed below random levels. For VSI-Bench, this was primarily because its spatial understanding questions necessitated true video perception and targeted data curation and training. For Tomato, the benchmark demanded comprehension of fine details from high-frame-rate videos, rendering these results as anticipated.

Replacing visual input with text captions also led to a substantial improvement in performance. On benchmarks like EgoSchema, the model's accuracy exceeded random guessing by over 20%.

To enhance diversity, the researchers collected data from 10 distinct sources of video types and annotations. This approach yielded a dataset significantly more robust than an equally sized dataset sourced from a single origin. The data processing pipeline comprised three steps:

Annotated Real Videos: Multimodal visual-spatial reasoning hinges on a solid grasp of 3D geometry and spatial relationships.

Simulated Data: Using embodied simulators to programmatically generate spatially relevant video trajectories and question-answer pairs, 625 videos were rendered in ProcTHOR scenes.

Unannotated Real Videos: Approximately 19,000 room tour videos were gathered from YouTube, and videos from robotic learning datasets were incorporated.

The data effectiveness ranking is as follows: Annotated Real Videos > Simulated Data > Pseudo-Annotated Images.

The results underscore that stronger foundation models yield superior Supervised Fine-Tuning (SFT) performance on spatial perception tasks.

More potent foundation models, exposed to a wider array of general video data, can enhance spatial perception abilities post-SFT. Furthermore, SFT with stronger foundation models can fortify spatial understanding capabilities.

The research team contends that to attain superintelligence, AI systems must transcend the text-based knowledge and semantic perception currently emphasized by most Multilevel Logical Models (MLLMs), while concurrently developing spatial cognition and predictive world models.

Although Cambrian-S performed admirably on standard benchmarks, its results on the VSI-S dataset exposed limitations of the current MLLM paradigm. By employing latent frame prediction and surprise estimation techniques, the researchers constructed a predictive sensing prototype to manage unbounded visual streams. This prototype enhanced Cambrian-S's performance on the VSI-S dataset.

Relevant personnel noted that current benchmarks, datasets, and model designs still exhibit limitations in terms of quality, scale, and generalization ability. The prototype merely serves as a proof of concept. Future endeavors should explore more diverse and embodied scenarios while forging closer ties with the latest advancements in vision, language, and world modeling.

References:

https://arxiv.org/pdf/2511.04670

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.