Paper Walkthrough — Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Monocular depth estimation, the prediction of distance in 3D space from a 2D image. The “ill posed and inherently ambiguous problem”, as stated in literally every paper on depth estimation, is a fundamental problem in computer vision and robotics. At the same time foundation models dominate the scene in deep learning based NLP and computer vision. Wouldn’t it be awesome if we could leverage their success for depth estimation too?
In today’s paper walkthrough we’ll dive into Depth Anything, a foundation model for monocular depth estimation. We will discover its architecture, the tricks used to train it and how it is used for metric depth estimation.
Paper: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhaoonathan Ho, Ajay Jain, Pieter Abbeel, 19 Jan. 2024
Code: https://github.com/LiheYoung/Depth-Anything
Project Page: https://depth-anything.github.io/
Conference: CVPR2024
Category: foundation models, monocular depth estimation
Outline
- Context & Background
- Method
- Qualitative Results
- Experiments & Ablations
- Further Readings & Resources
Context & Background
Why is depth such an important modality and why using deep learning for it?
Put simply: to navigate through 3D space, one must need to know where all the stuff is and at which distance. Classical applications include collision avoidance, drivable space detection, placing objects into a virtual or augmented reality, creating 3D objects, navigating a robot to grab an object and many…