Meta AI and Carnegie Mellon launch MapAnything: the future of 3D reconstruction

The collaboration between Meta Reality Labs and Carnegie Mellon University has produced MapAnything, a leading 3D reconstruction model that promises to transform the landscape of computer vision. This innovative system, based on an end-to-end transformer architecture, generates metric 3D scene geometry by processing images and optional sensor inputs. Available under the Apache 2.0 license, MapAnything comes with its training and benchmarking code, marking a milestone in its ability to support over 12 different 3D vision tasks in a single process.

The Need for a Universal Model for 3D Reconstruction

Traditionally, 3D reconstruction techniques from images have relied on fragmented processes ranging from feature detection to pose estimation across multiple views and monocular depth inference. While these methods have proven effective, they require specific adjustments, optimization, and extensive post-processing.

Recent models utilizing transformers, such as DUSt3R, MASt3R, and VGGT, have simplified aspects of this process, but still face limitations such as fixed view spaces, stringent assumptions about cameras, and reliance on representations that demand costly optimizations.

MapAnything stands out by overcoming these limitations. This model can take up to 2,000 input images in a single inference, allows for flexible incorporation of auxiliary data such as camera intrinsics and depth maps, and generates metric 3D reconstructions directly without the need for complicated adjustments. Its modular and generalized approach offers significant advancements over previous methods.

Architecture and Representation

The architecture of MapAnything is based on a multi-view alternating attention transformer. Each input image is encoded using DINOv2 ViT-L features, while complementary inputs (rays, depth, and poses) are encoded into the same latent space through shallow CNNs or MLPs. A learnable scale token enables metric normalization across different views.

The network provides a factorized representation that includes:

Ray directions per view, facilitating camera calibration.
Depth along the rays predicted at scale.
Camera positions relative to a reference view.
A single learnable metric scale factor that transforms local reconstructions into a globally consistent system.

This explicit representation prevents redundancies and allows the model to address various tasks, from monocular depth estimation to structure-from-motion (SfM) and depth completion, without the need for specialized heads.

Training Strategy

MapAnything was trained using 13 varied datasets covering indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two versions of the model have been released:

An Apache 2.0 licensed model trained on six datasets.
A CC BY-NC licensed model that has been trained on all thirteen datasets for superior performance.

Key training strategies include:

Probabilistic input dropout: During training, geometric inputs (rays, depth, and pose) are provided with variable probabilities, ensuring robustness across diverse settings.
Covisibility-based sampling: Ensures that input views have significant overlap, allowing reconstruction from more than 100 views.
Logarithmic space factorized losses: Depth, scale, and pose are optimized using robust, scale-invariant regression losses, thereby improving the model's stability.

Training was performed on 64 H200 GPUs using mixed precision, with gradient checkpointing and a curriculum scheduling that progressed from 4 to 24 input views.

Benchmarking Results

Dense Multi-View Reconstruction

In tests conducted on the ETH3D, ScanNet++ v2, and TartanAirV2-WB datasets, MapAnything achieved state-of-the-art (SoTA) results in terms of point clouds, depth estimation, poses, and rays. This model outperformed references like VGGT and Pow3R, even when limited to using images alone. Additionally, its performance significantly improves with calibration data or pose priors.

For example, the relative error of the point cloud reduced to 0.16 using only images, improving from the 0.20 recorded by VGGT. By incorporating images along with intrinsics, poses, and depth, the error drops to 0.01, achieving over 90% inlier ratios.

Two-View Reconstruction

Compared to DUSt3R, MASt3R, and Pow3R, MapAnything consistently surpasses these models in scale accuracy, depth, and poses. With additional priors, it achieves over 92% inlier ratios in two-view tasks, reaching significantly higher performance levels than previous models under feedback.

Single-View Calibration

Although MapAnything was not specifically trained for single image calibration, it achieved an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

In robust MVD evaluations, MapAnything sets a new SoTA record for metric depth estimation in multiple views. With auxiliary aids, its error rates rival or surpass specialized depth models like MVSA and Metric3D v2.

Overall, the benchmarks demonstrate up to double the improvement over previous SoTA methods across various tasks, thus validating the benefits of a unified training approach.

Key Contributions

The research team attributes four main contributions to MapAnything:

A unified feedback model capable of tackling over 12 problem configurations, ranging from monocular depth to SfM and stereo.
A factorized scene representation that allows for explicit separation of rays, depth, poses, and metric scale.
State-of-the-art performance across various benchmarks, with reduced redundancies and greater scalability.
Open-source publication, including data processing, training scripts, benchmarks, and pre-trained weights under the Apache 2.0 license.

Conclusion

MapAnything sets an innovative standard in 3D vision, unifying multiple reconstruction tasks, such as SfM, stereo, depth estimation, and calibration, within a single transformer-based model with a factorized scene representation. This model not only exceeds specialized methods across a range of benchmarks but also adapts efficiently to heterogeneous inputs, such as intrinsics, poses, and depth. With its open-source code, pre-trained models, and support for over 12 different tasks, MapAnything lays the foundation for a true general-purpose 3D reconstruction approach.

To learn more about advancements in technology and innovation, I invite you to continue exploring more content on my blog.