12/01 2025
445
Meta's MSL lab has just unveiled its 3D reconstruction model, SAM 3D.
Now, the 'Segment Everything' capability can directly produce 3D models.
Recently, Meta released the SAM 3 paper, which details a model adept at detecting, segmenting, and tracking objects in both images and videos. It also supports short text phrases and example prompts for enhanced functionality.
SAM 3 now introduces a groundbreaking 3D paradigm, extending the model's capabilities into three-dimensional space. This allows for precise reconstruction of 3D objects and figures from a single 2D image.
This release encompasses two entirely new models:
SAM 3D Objects model: Designed for object and scene reconstruction.
SAM 3D Body model: Specialized for human body and shape estimation.
Both models excel at transforming static 2D images into detailed 3D reconstruction models.
The SAM 3D Objects model facilitates stable, vision-based 3D reconstruction and object pose estimation from a single natural image. This enables the reconstruction of fine 3D shapes, textures, and layouts of objects captured in everyday images.
When dealing with small objects, indirect viewpoints, or occlusion, the model can enhance reconstruction by leveraging contextual information to compensate for pixel deficiencies.
With SAM 3D Objects, users can select any object from a single image and swiftly generate a 3D model. This empowers users to precisely manipulate individual objects within the reconstructed 3D scene and view them from various angles.
Previous 3D models were constrained by data availability, with limited real-world data restricting their application to synthetic or artificially staged scenes.
To tackle more challenging scenarios commonly encountered in everyday environments, a novel approach is imperative.
SAM 3D Objects utilizes a robust data annotation engine to seamlessly integrate 3D data and training schemes.
Unlike text, images, or videos, creating 3D real-world data from scratch demands highly specialized skills, and 3D data collection is both inefficient and costly.
However, verifying or ranking meshes is relatively straightforward. Scalability can be achieved by constructing a data engine. This engine tasks annotators with scoring multiple options generated by a loop model suite, with 3D artists completing the most challenging parts to fill in data gaps.
In this manner, researchers can annotate images of the physical world on a large scale, encompassing the shape, texture, and layout of 3D objects. SAM 3D has annotated nearly 1 million different images and generated approximately 3.14 million loop model meshes.
To enable the model to handle natural images, subsequent post-training stages necessitate calibration. The data engine can also supply data to bolster the post-training process. Enhancements in model robustness and output quality further empower the data engine to generate data more effectively, creating a positive feedback loop.
To establish a benchmark dataset for the natural image distribution of single-image 3D reconstruction of physical world objects, Meta has also constructed the SAM 3D Artist Object Dataset (SA-3DAO). This marks the first dataset dedicated to evaluating vision-based 3D reconstruction of physical world images, featuring more challenging and diverse images and objects. SAM 3D Objects demonstrates strong generalization capabilities and supports dense scene reconstruction. In direct comparison tests with human users, its success rate is at least five times that of other leading models. High-quality reconstruction results can be generated in seconds through diffusion shortcuts and engineering optimizations.
Officials have stated that the next step involves enhancing output resolution and refining object layout.
Currently, the output resolution is capped at a 'moderate' level, limiting the detail representation of complex objects and potentially causing distortions or loss of detail. Hence, increasing the output resolution is the natural next step.
SAM 3D Objects can currently predict only one object at a time and lacks training in reasoning about physical interactions. If it could predict multiple objects and integrate them with appropriate loss functions, joint reasoning about multiple objects in a scene would become feasible.
SAM 3D Body accurately estimates 3D human poses and shapes, even in complex scenarios involving unusual poses, partially occluded images, or multiple people.
The model employs a new open-source 3D mesh format called Meta Momentum Human Rig (MHR), enhancing interpretability by separating the human skeletal structure and soft tissue shape.
Based on a Transformer encoder-decoder architecture, the model can predict MHR mesh parameters. The image encoder adopts a multi-input design to capture high-resolution details of human body parts, while the mesh decoder is extended to support prompt-based predictions.
SAM 3D Body leverages large-scale, high-quality data and a robust training strategy to deliver accurate and reliable 3D human pose and shape estimation.
Researchers constructed a vast dataset containing billions of images and utilized a scalable automated data engine to mine high-value images. They assembled a high-quality training dataset comprising approximately 8 million images to train the model, enabling it to handle occlusions, rare poses, and various clothing types.
Moving forward, the research team has stated that they will incorporate interactions between humans, objects, and the environment into model training. They will also continue to refine the performance of hand pose estimation, which currently lags behind specialized hand pose estimation methods.
References:
https://ai.meta.com/blog/sam-3d/?utm_source=twitter&utm_medium=organic_social&utm_content=video&utm_campaign=sam