11/17 2025
382
Recently, ByteDance, in collaboration with a team from the University of Hong Kong, has officially launched the Mini-o3 model. Touted as a viable alternative to OpenAI's o3 in visual reasoning, this visual language model (VLM) is capable of extending its thinking processes to dozens of turns during testing, despite being trained with only six rounds of constraints.
The core innovations of Mini-o3 encompass the creation of the challenging Visual Probe dataset, an iterative data collection pipeline, and an over-turn masking strategy. These innovations support a variety of reasoning modes, including depth-first search. During testing, the number of interactive turns can be expanded to over 32, with accuracy significantly enhancing as the number of turns increases.
At present, Mini-o3 has achieved top-tier results within the 7B parameter range on benchmarks such as VisualProbe, VBench, HR-Bench, and MME-Realworld. Furthermore, the training code, model weights, and the Visual Probe dataset, which contains 4,500 data points, have all been made open-source for the broader research community.
Mini-o3 pushes the boundaries of interaction depth and reasoning modes by introducing an effective multimodal agent training scheme. This scheme enables the agent to utilize image tools over multiple turns, thereby enhancing adaptability and reasoning diversity for visual foundation tasks.
The training process is divided into two key stages:
The team has curated a challenging visual search dataset known as the Visual Probe dataset. It comprises 4,000 visual question-answering sets for training purposes and 500 for testing. The defining characteristics of VisualProbe are:
To produce high-quality, diverse multi-turn trajectories, Mini-o3 retains only those trajectories where the final answer is accurate. Following this rigorous process, the team amassed approximately 6,000 cold-start trajectories from six examples.
To facilitate a greater number of feasible turns in each session, the team reduced the maximum pixel count per image to two million. This straightforward adjustment allows for more turns within the same context budget, thereby enhancing the resolution rate of long-horizon problems.
To deter the model from adopting a 'premature answering' approach, the team also introduced an over-turn masking technique. This technique aims to avoid penalizing 'unfinished' answers.
This technique mitigates the loss previously incurred by 'masking' unfinished answers, thereby encouraging the model to persist in its exploration.
Notably, despite a relatively modest cap on the number of overturns during training, testing-time trajectories can extend to dozens of turns, with accuracy increasing as the number of turns rises. Hence, turn masking is pivotal for achieving the advantage of testing-time expansion in interactive overturn counts.
The pivotal finding of this study is that, although Mini-o3 (depicted by the blue line) was trained with only a six-round cap, its accuracy on the VisualProbe-Hard dataset continuously improved from 38% to 48% as the cap on interactive turns increased from 4 to 32 during testing. This suggests that the model has genuinely learned to 'think,' and the more thoroughly it thinks, the better the outcomes. In contrast, the model without the Over-turn Masking strategy (represented by the red line) ceased to improve after six rounds.
Across multiple visual search benchmark tests, Mini-o3 has set new state-of-the-art (SOTA) records, substantially outperforming existing open-source models. Particularly on the most challenging VisualProbe-Hard task, Mini-o3 achieved an accuracy of 48.0%, a significant improvement over the previous best, DeepEyes (35.1%).
Results from ablation experiments further validate the innovative design of Mini-o3: eliminating the Visual Probe dataset, cold-start SFT, or Over-turn Masking all result in substantial declines in model performance.
The Mini-o3 research team comprises six authors, with Lai Xin and Junyi Li serving as co-first authors of the project.
Publicly available information reveals that Lai Xin is a researcher at ByteDance, specializing in large multimodal models. He completed his undergraduate studies at Harbin Institute of Technology and subsequently earned a Ph.D. from the Chinese University of Hong Kong. During his doctoral studies, he contributed as the first author to the Step-DPO project, which attained accuracies of 70.8% and 94.0% on MATH and GSM8K, respectively.
Junyi Li pursued his studies at Huazhong University of Science and Technology and is currently a Ph.D. student at the University of Hong Kong, engaging in research at ByteDance. In 2024, his project PartGLEE, for which he was the first author, was accepted by ECCV.
In their work on Mini-o3, the team explored the utilization of multi-turn image-based tools for visual language models (VLMs). They developed a comprehensive approach:
The research team stated that Mini-o3's technical solutions offer practical guidance for the development of multi-turn interactive multimodal models and the application of reinforcement learning.