Ali's Cutting-Edge Visual Language Model: Qwen3-VL - Perceptive, Insightful, and Highly Capable - AI

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

Ali's Cutting-Edge Visual Language Model: Qwen3-VL - Perceptive, Insightful, and Highly Capable

11/17 2025 560

Today, Ali's Tongyi Qianwen proudly unveiled the newly enhanced Qwen3-VL series, marking the most powerful visual language model within the Qwen lineup to date.

Qwen3-VL showcases remarkable progress in text comprehension and generation, perception and reasoning abilities, support for extended context lengths, and agent interaction.

Currently, Ali has made Qwen3-VL-235B-A22B open-source, encompassing both Instruct and Thinking variants. The Instruct version delivers performance on par with or surpassing that of Gemini 2.5 Pro across multiple mainstream visual perception assessments. Meanwhile, the Thinking version achieves state-of-the-art (SOTA) results in numerous multimodal reasoning benchmarks.

Qwen3-VL aims to empower models not only to 'see' images or videos but also to comprehend events and initiate actions, transitioning from mere 'recognition' to 'reasoning and execution'.

Key features of the model include:

Visual Agent Capabilities: Qwen3-VL can seamlessly navigate computer and mobile interfaces, identify GUI elements, understand button functionalities, invoke tools, and execute tasks. This effectively boosts performance in fine-grained perception tasks through strategic tool invocation.

Enhanced Visual Coding Abilities: The model can generate code from images and videos, such as transforming design sketches into Draw.io/HTML/CSS/JS code.

Improved Spatial Awareness: It shifts 2D grounding from absolute to relative coordinates, supporting object orientation judgment, perspective changes, occlusion relationships, and enabling 3D grounding.

Long Context Support and Extended Video Understanding: All model variants inherently support context lengths of 256K tokens, extendable up to 1 million tokens. Whether it's hundreds of pages of technical documents, entire textbooks, or videos spanning up to two hours, they can be fully ingested and retained.

Overall, Qwen3-VL-235B-A22B-Instruct excels in most metrics among non-reasoning models, outperforming closed-source models like Gemini 2.5 Pro and GPT-5. It demonstrates robust generalization capabilities and comprehensive performance in complex visual tasks.

In the realm of reasoning models, Qwen3-VL-235B-A22B-Thinking surpasses Gemini 2.5 Pro in complex multimodal math problems such as Mathvision. While there is still a performance gap compared to closed-source SOTA models in multidisciplinary problems, visual reasoning, and video understanding, it exhibits distinct advantages in agent capabilities, document understanding, and 2D/3D grounding tasks.

For pure text tasks, the performance of both the Instruct and Thinking versions of Qwen3-VL-235B-A22B is on par with that of Qwen3-235B-A22B-2507.

The team has refined the structural design based on a native dynamic resolution approach.

Firstly, MRoPE-Interleave is implemented, utilizing an interleaved distribution to achieve comprehensive coverage of time, height, and width frequencies. This more robust positional encoding enhances long video understanding while preserving equivalent image comprehension capabilities.

Secondly, DeepStack technology is introduced, fusing multi-level ViT features to enhance visual detail capture and text-image alignment precision. The previous paradigm of single-layer visual token input in large multimodal models (LMMs) is replaced by injection across multiple layers of the large language model (LLM). This design effectively preserves rich visual information from low-level to high-level, boosting the model's visual understanding capabilities.

Thirdly, the original video temporal modeling mechanism T-RoPE is upgraded to a text timestamp alignment mechanism. This approach employs an interleaved 'timestamp-video frame' input format to achieve precise alignment of frame-level temporal information with visual content, improving semantic perception and temporal localization accuracy for actions and events in videos.

Qwen3-VL can operate mobile phones and computers akin to human users, automatically completing numerous daily tasks such as opening apps, clicking buttons, and filling in information. This enables intelligent interaction and automated operations.

Qwen3-VL can scrutinize local details of images and engage in complex reasoning using tools.

Qwen3-VL integrates visual understanding and code generation capabilities for front-end development. For instance, it can convert hand-drawn sketches into web code or assist in debugging interface issues, thereby enhancing development efficiency.

Users can now directly leverage the officially provided API to experience the Qwen3-VL series model, Qwen3-VL-235B-A22B. Usage examples are as follows:

References: https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links

Ali's Cutting-Edge Visual Language Model: Qwen3-VL - Perceptive, Insightful, and Highly Capable

Insights from Q3 Financial Reports of Auto Companies: Geely Excels in Scale, XPENG and Leapmotor Achieve Separate Breakthroughs