DeepMind Releases Veo 3 Paper: Ushering in the GPT-3 Era for Visual Reasoning!

11/17 2025 470

The process of generating videos frame by frame is akin to the chain-of-thought mechanism in language models. Much like how the chain-of-thought (CoT) approach empowers language models to engage in symbolic reasoning, the 'chain-of-frames' (CoF) concept enables video models to conduct reasoning across both time and space.

DeepMind's latest Veo 3 paper marks the inaugural introduction of the Chain of Frames (CoF) concept.

The emerging zero-shot capabilities of Veo 3 indicate that video models are progressing towards becoming unified, general-purpose visual foundation models.

The zero-shot capabilities inherent in large language models (LLMs) have propelled the evolution of natural language processing (NLP), shifting it from task-specific models to unified, general-purpose foundation models.

This transformation is rooted in a fundamental element: large generative models trained on data at a web-scale level.

Interestingly, this same fundamental element is now applicable to today's generative video models. The Google DeepMind team has showcased that Veo 3 can tackle a diverse array of tasks without requiring explicit training, such as object segmentation, edge detection, image editing, comprehension of physical properties, recognition of object affordances, simulation of tool use, and more. These abilities to perceive, model, and manipulate the visual world pave the way for early forms of visual reasoning, including maze solving and symmetry detection.

Can video models attain universal visual understanding, akin to how LLMs have achieved universal language understanding?

DeepMind's response is affirmative.

They employed a straightforward method: providing Veo 3 with an initial input image and text instructions as prompts.

In the realm of natural language processing (NLP), prompting has supplanted task-specific training or adaptation. Now, driven by video models, a comparable paradigm shift is on the verge of occurring in the field of machine vision.

The team initially carried out a qualitative study of visual tasks to evaluate the potential of video models as visual foundation models. The findings were classified into four hierarchical levels, each building on the previous one.

These four levels are perception, modeling, manipulation, and reasoning. This hierarchical structure offers a framework for comprehending the emerging capabilities of video models.

The results reveal that Veo 3 exhibits emerging zero-shot perception capabilities that extend beyond its trained tasks, potentially rendering most custom models in computer vision obsolete.

The DeepMind team conducted quantitative evaluations of Veo on a variety of tasks, encompassing perception (edge detection, segmentation, and object extraction), manipulation (image editing performance), maze solving, visual symmetry, and visual analogy.

The observed data indicates a substantial performance enhancement from Veo 2 to Veo 3, with Veo 3 matching or even surpassing the performance of Nano Banana.

Despite receiving no specialized training, Veo 3 can be prompted to detect and perceive edges, producing edge maps that are more detailed than the ground truth.

Through experiments, the team discovered that Veo 3 excels at preserving details and textures during the editing process. With improved control over factors such as camera movement or character animation, video models could evolve into powerful 3D-aware image and video editors.

Veo 3 demonstrates zero-shot maze-solving capabilities, outperforming Veo 2. In a 5×5 grid, Veo 3 achieved a 78% success rate, in contrast to Veo 2's 14%.

In tests focused on visual symmetry solving, Veo 3 significantly outperformed both Veo 2 and Nano Banana.

In visual analogy tests, Veo 3 correctly completed examples involving color and resizing, showcasing its ability to comprehend changes and relationships between objects.

Thanks to the emerging capabilities of large-scale video models, machine vision is on the brink of a similar paradigm shift.

Veo 3 can solve a variety of tasks in a zero-shot manner, covering perception, modeling, manipulation, and even early forms of visual reasoning. Although its performance is not yet flawless, the significant and consistent improvements from Veo 2 to Veo 3 suggest that video models will eventually become universal visual foundation models, just as LLMs have for language.

Reference: https://arxiv.org/pdf/2509.20328

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.