RGB-D salient object detection is a challenging task in computer vision, and deep architectures have been widely adopted in the previous studies. However, current convolutional neural network (CNN)-based models struggle with capturing global long-distance features efficiently, whereas transformer-based methods are computationally intensive. To address these limitations, we propose a nonconvolutional feature encoder. This encoder captures long-distance dependencies while reducing computation costs, making it a potential alternative to CNNs and transformers. Additionally, we introduce a spatial info enhancing mechanism to overcome weakened local information while capturing long-range dependencies. This mechanism balances local and global information at different expansion rates by exploring multiscale feature fusion in the feature maps. Furthermore, we introduce a spatial info sensing module to enhance the compatibility of multimodal features in long-range dependencies and extract informative cues from depth features. Through comprehensive experiments on four widely used datasets, we demonstrate that our proposed involution encoder significantly outperforms previous state-of-the-art RGB-D salient object detection methods based on CNNs in four key metrics. Compared to transformer-based methods, our approach balances speed and efficiency favorably. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
RGB color model
Object detection
Feature fusion
Lithium
Convolution
Feature extraction
Performance modeling