Depth Anything —A Foundation Model for Monocular Depth Estimation

Paper Walkthrough — Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

11 min read

12 hours ago

Monocular depth estimation, the prediction of distance in 3D space from a 2D image. The “ill posed and inherently ambiguous problem”, as stated in literally every paper on depth estimation, is a fundamental problem in computer vision and robotics. At the same time foundation models dominate the scene in deep learning based NLP and computer vision. Wouldn’t it be awesome if we could leverage their success for depth estimation too?

In today’s paper walkthrough we’ll dive into Depth Anything, a foundation model for monocular depth estimation. We will discover its architecture, the tricks used to train it and how it is used for metric depth estimation.

Image by Sascha Kirch

Paper: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhaoonathan Ho, Ajay Jain, Pieter Abbeel, 19 Jan. 2024

Code: https://github.com/LiheYoung/Depth-Anything

Project Page: https://depth-anything.github.io/

Conference: CVPR2024

Category: foundation models, monocular depth estimation

Other Walkthroughs: [BYOL], [CLIP], [GLIP], [SAM], [DINO]

Outline

  1. Context & Background
  2. Method
  3. Qualitative Results
  4. Experiments & Ablations
  5. Further Readings & Resources

Context & Background

Why is depth such an important modality and why using deep learning for it?

Fig.1: Image and corresponding depth map. Image by Sascha Kirch and Depth Map created with Depth Anything Hugging Face Demo.

Put simply: to navigate through 3D space, one must need to know where all the stuff is and at which distance. Classical applications include collision avoidance, drivable space detection, placing objects into a virtual or augmented reality, creating 3D objects, navigating a robot to grab an object and many…