01/12 2026
411
Analysis: The Future of AI-Generated Content

Key Highlights

Figure 1: Depiction of a unified multimodal representation space. The Qwen3-VL-Embedding model series represents diverse data sources (text, images, visual documents, and videos) on a common manifold. By aligning semantic concepts across modalities (e.g., linking the text "urban architecture" with its corresponding image), the model achieves a comprehensive understanding of complex visual and textual information.
Problems Addressed
Proposed Solution
Applied Technologies
Achieved Results
Model Architecture
Figure 2: Overview of Qwen3-VL-Embedding and Qwen3-VL-Reranker Architectures
Qwen3-VL-Embedding and Qwen3-VL-Reranker are designed to perform task-aware relevance judgments on multimodal instances.
Model Architecture Foundation: Both models are based on the Qwen3-VL backbone network, utilizing causal attention mechanisms. After training on large-scale multimodal, multi-task relevance data, the models retain the backbone's world knowledge, multimodal perception, and instruction-following capabilities while acquiring relevance assessment abilities. This work trained models in two sizes—2B and 8B. Table 1 below summarizes both:

Embedding Method: The Embedding model extracts task-aware dense vectors for multimodal inputs. The input format follows the Qwen3-VL context structure:

Reranking Method: The Reranking model employs a Pointwise ranking approach.


Data
To equip the model with universal representation capabilities across different modalities, tasks, and domains, a large-scale dataset was curated. The distribution of different categories within the dataset is shown in Figure 3. However, both publicly available and proprietary internal data exhibit significant imbalances across these dimensions, with notable scarcity in specific scenarios. To address these challenges, data synthesis was employed to construct a balanced training corpus, ensuring robust coverage of all modalities, tasks, and domains.

Dataset Format
The complete dataset comprises multiple sub-datasets, denoted as . Each sub-dataset is defined by a quadruple , structured as follows:
Representative dataset examples are shown in Appendix A.
Data Synthesis
Data synthesis was employed to construct various sub-datasets . Specifically, the method introduced in Qwen3 Embedding was extended to a multimodal scenario. As shown in Figure 4, a diverse set of seed multimodal content (e.g., images/videos from the web) was first curated. Then, Qwen3-VL-Instruct was utilized to generate: (1) synthetic instructions, (2) synthetic queries, and (3) pseudo-relevance labels.
The specific process is as follows:
This synthetic approach enabled the creation of large-scale, diverse, and task-specific training data, addressing the scarcity of naturally occurring multimodal retrieval data.
Positive Sample Optimization and Hard Negative Mining
Hard negatives play a crucial role in contrastive representation learning. To improve the quality of positive pairs and identify effective hard negatives, an automated two-stage mining pipeline was implemented: Recall and Relevance Filtering.
Training Strategy
To train our Qwen3-VL-Embedding and Qwen3-VL-Reranker models, a multi-stage training pipeline was adopted, as shown in Figure 5. This approach aimed to mitigate data imbalances between large volumes of weakly supervised data and scarce high-quality samples. The model was first pre-trained on large-scale weakly supervised, noisy data to establish a baseline for relevance understanding and enhance generalization capabilities. Subsequently, fine-tuning was performed on high-quality, task-specific datasets to guide the model toward more precise relevance scoring and fine-grained interactions. Beyond these reasons, another objective of the multi-stage training strategy was to iteratively improve data quality and model performance. As training progressed through consecutive stages, the model's capabilities continuously enhanced. This improvement, in turn, facilitated more effective data mining, optimizing the quality of the training data. This iterative cycle ultimately led to substantial overall performance gains in the model.

Multi-Stage Training
The following three-stage training strategy was implemented:
Implementation Details
The training process of the Embedding model utilized a multi-task learning approach, integrating various loss functions such as the InfoNCE loss, CoSent loss, MRL loss, binary quantization loss, and distillation loss.
Retrieval Tasks. When dealing with retrieval tasks, we opted for the InfoNCE loss. Given a query within a batch, its corresponding positive document, and a set of negative documents, the loss is calculated as follows:
Here, the symbol $text{cos}$ represents cosine similarity, and $tau$ stands for a temperature hyperparameter. We relied on in-batch negatives and supplemented them with hard negatives, which were mined as detailed in Section 3.3.
Semantic Textual Similarity (STS) Tasks. For STS tasks, aiming to capitalize on fine-grained similarity scores, we employed the CoSent loss:
In this formula, $mathcal{P}$ and $mathcal{N}$ denote the sets of positive and negative sample pairs, respectively, while $lambda$ serves as a scaling factor.
Classification Tasks. In classification tasks, we treated label descriptions as queries and inputs (such as images or videos) as documents. The loss function is similar to the previous one, but when constructing negative samples, we only included samples from different classes and excluded any samples from the same class to prevent false negatives.
Knowledge Distillation. During Stage 3, knowledge distillation was implemented to align the score distributions of the Embedding model with those of the Reranker teacher model. For a given query and a set of candidate documents, the distillation loss is defined as:
Here, $p^{text{teacher}}$ and $p^{text{student}}$ represent the softmax-normalized score distributions of the teacher and student models on the candidate documents, respectively.
Matryoshka Representation Learning (MRL). To enable flexible embedding dimensions, we adopted MRL. For a set of nested dimensions $d_1 lt d_2 lt cdots lt d_K$, the total loss is computed as:
In this equation, $text{emb}_{d_i}$ denotes the embedding truncated to the first $d_i$ dimensions, and $alpha_i$ is a weighting coefficient.
Quantization-Aware Training. To ensure high performance even after binary quantization, we incorporated quantization loss during the training phase. As suggested by Zhang et al. (2025c), instead of directly binarizing the embeddings, we used a pseudo-quantization regularization term to encourage binarization-friendliness:
Here, $mathbf{e}$ represents the embedding vector, and $text{sg}$ denotes the stop-gradient operation. This approach encourages the embedding vectors to cluster near the vertices of a hypercube.
Loss Function for the Reranker Model
We framed re-ranking as a binary classification problem: given a query-document pair, the model predicts either a "yes" token (indicating relevance) or a "no" token (indicating non-relevance).
The loss function is given by:
where $P(text{yes}|mathbf{q},mathbf{d})$ denotes the probability assigned by the VLM. For positive pairs, the label $y$ is set to "yes," while for negative pairs, it is set to "no." This loss function motivates the model to assign higher probabilities to correct labels, thereby enhancing ranking performance (Dai et al., 2025).
During the inference stage, the final relevance score is computed by applying the sigmoid function to the difference between the logits of the "yes" and "no" tokens:
Evaluation Results

MMEB-V2 Benchmark: Qwen3-VL-Embedding-8B achieved an impressive overall score of 77.8, demonstrating exceptional performance across all subtasks, including image, video, and visual document retrieval. It outperformed models such as VLM2Vec, GME, and closed-source models like Google Gemini Embedding and OpenAI text-embedding-3-large.
Visual Document Retrieval: On multiple benchmarks, including VisRAG and ViDoRe, the Qwen3-VL-Embedding and Reranker series models showcased dominant performance, surpassing models like ColPali and ColQwen2.
Text Benchmarks: Despite being a multimodal model, Qwen3-VL-Embedding-8B achieved an average score of 67.9 on the MMTEB text-only benchmark, which is comparable to the performance of pure text Embedding models of similar scale.
Reranking Performance: Qwen3-VL-Reranker-8B significantly outperformed baseline models in most re-ranking tasks, showing substantial improvements over its 2B version.
Ablation Experiments
MRL and Quantization: Experiments revealed that performance declined as dimensions decreased. However, within a reasonable range (e.g., from 1024 to 512), the performance loss was minimal (~1.4%), while achieving a 50% reduction in storage. Int8 quantization incurred almost no precision loss, whereas binary quantization exhibited more noticeable performance degradation at lower dimensions.
Spatiotemporal Granularity Impact: Increasing the number of tokens for images and frames for videos improved performance, but with diminishing marginal returns. Excessively long contexts could even lead to a slight decline in performance.
Multi-Stage Training Effects: Ablation studies indicated that the transition from S0 to S1 (multi-task fine-tuning) yielded significant improvements. S2 (distillation) substantially enhanced retrieval task performance at the expense of some classification capabilities. The final S3 (merging) successfully balanced all capabilities, achieving optimal overall performance.

Conclusion
This report introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker, a series of cutting-edge models designed for multimodal retrieval. By integrating a multi-stage training pipeline with high-quality multimodal data and leveraging the multimodal knowledge and general understanding capabilities of the Qwen3-VL base model, the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series have achieved unprecedented performance across a wide range of multimodal retrieval benchmarks while maintaining strong pure text capabilities.
Furthermore, by incorporating Matryoshka Representation Learning (MRL) and Quantization-Aware Training (QAT), the Qwen3-VL-Embedding series exhibits exceptional practical deployment characteristics, significantly reducing computational costs for downstream tasks while maintaining superior performance. Looking ahead, promising research directions include extending support for more modalities, developing more efficient training paradigms, enhancing compositional reasoning capabilities, and establishing more comprehensive evaluation protocols. This work believes that these models represent a significant advancement in multimodal retrieval technology and hopes they will drive further innovation in this rapidly evolving field.
References
[1] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking