12 December 2024 Adapting Segment Anything Model for self-supervised monocular depth estimation
Dongdong Zhang, Chunping Wang, Qiang Fu
Author Affiliations +
Abstract

We address the problem of self-supervised depth estimation. Given the significance of long-range correlation in-depth estimation, we propose to use the Segment Anything Model (SAM) encoder, a special Vision Transformer (ViT), to model the global context for accurate depth estimation. We also employ a convolutional neural network encoder to assist the network in gathering local information as ViT lacks spatial inductive bias in modeling local information. However, independent encoders lead to insufficient aggregation among features. To compensate for this deficiency, we design a heterogeneous fusion module that facilitates feature fusion by modeling the affinity among heterogeneous features. Due to the unbearable computational burden imposed by the introduction of SAM, we substitute the original SAM with a lightweight variant of SAM to reduce the complexity of the entire network. Extensive experiments on the KITTI dataset show that our proposed model achieves the most competitive results.

© 2024 SPIE and IS&T
Dongdong Zhang, Chunping Wang, and Qiang Fu "Adapting Segment Anything Model for self-supervised monocular depth estimation," Journal of Electronic Imaging 33(6), 063045 (12 December 2024). https://doi.org/10.1117/1.JEI.33.6.063045
Received: 14 April 2024; Accepted: 27 November 2024; Published: 12 December 2024
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image segmentation

Visual process modeling

Feature fusion

Education and training

Modeling

Semantics

Data modeling

Back to Top