Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

Abstract

Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a pre-trained diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner.

To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

Paper at a Glance

We capture two images of a scene from the same viewpoint using a camera focused at a fixed distance: a sharp, all-in-focus image (using a high F-stop number) and a defocused image (using a lower F-stop number to introduce blur). Given the sharp image and an initial learnable noise vector representing depth, the Marigold framework estimates a relative depth map. Importantly, Marigold itself is a fixed, training-free framework—its architecture and operations do not change during optimization.

The estimated relative depth is then converted to metric depth via an affine transformation with learnable parameters. Using the sharp image, the estimated metric depth, and known camera parameters, we synthesize a defocused image using a differentiable forward model of defocus blur.

Training is guided by minimizing the L2 loss between the synthesized defocused image and the actual captured defocused image. This loss is backpropagated to update the learnable noise vector and affine transformation parameters, enabling the recovery of scene depth without training the Marigold model itself.

Results

Our method consistently estimates accurate metric depth across all the scenes

**Table 1:** Our method with the Disc PSF outperforms all MMDE baselines averaged over all scenes in our dataset. The disc PSF, being more consistent with the real camera PSF, also outperforms the Gaussian PSF.
Method	RMSE ↓	REL ↓	log10 ↓	δ₁ ↑	δ₂ ↑	δ₃ ↑	CD ↓	FA ↑
MLPro	0.468	0.246	0.105	0.597	0.821	0.990	0.205	0.696
UniDepth	0.574	0.358	0.152	0.263	0.757	0.902	0.260	0.612
Metric3D	0.349	0.195	0.087	0.611	0.958	0.983	0.135	0.814
Ours - Gaussian	0.528	0.279	0.142	0.422	0.695	0.928	0.241	0.652
Ours - Disc	0.273	0.125	0.052	0.879	0.975	0.991	0.103	0.870

Dataset

The dataset comprises seven diverse real-world indoor scenes captured at multiple defocus blur levels. RGB images were acquired using a Canon EOS 5D Mark II with a 21MP sensor (5616 × 3744 resolution, 6.41 μm pixel pitch) and an RGGB Bayer pattern. A 50mm Canon lens was used, with F-stop settings ranging from f/1.4 to f/22, and the focus distance fixed at 80 cm. Ground truth depth was obtained using an Intel RealSense D435 stereo depth camera, which offers less than 2% depth error at 2 meters. To reduce noise such as flying pixels, each depth map is averaged over 60 frames. The dataset can be found at this link. Please see the Readme in the dataset as well.

BibTeX

@misc{talegaonkar2025repurposingmarigoldzeroshotmetric,
      title={Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues}, 
      author={Chinmay Talegaonkar and Nikhil Gandudi Suresh and Zachary Novack and Yash Belhe and Priyanka Nagasamudra and Nicholas Antipa},
      year={2025},
      eprint={2505.17358},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.17358}, 

}