We address the problem of self-supervised depth estimation. Given the significance of long-range correlation in-depth estimation, we propose to use the Segment Anything Model (SAM) encoder, a special Vision Transformer (ViT), to model the global context for accurate depth estimation. We also employ a convolutional neural network encoder to assist the network in gathering local information as ViT lacks spatial inductive bias in modeling local information. However, independent encoders lead to insufficient aggregation among features. To compensate for this deficiency, we design a heterogeneous fusion module that facilitates feature fusion by modeling the affinity among heterogeneous features. Due to the unbearable computational burden imposed by the introduction of SAM, we substitute the original SAM with a lightweight variant of SAM to reduce the complexity of the entire network. Extensive experiments on the KITTI dataset show that our proposed model achieves the most competitive results. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Image segmentation
Visual process modeling
Feature fusion
Education and training
Modeling
Semantics
Data modeling