Saliency map have been increasingly employed in detection and recognition tasks, since it can highlight areas of an image that attract human attention. The fusion of saliency map and original image can solve the problem of complex scene or high occluded object detection. However, we discover that the existing techniques of element-wise addition and element-wise multiplication fusion will directly affect image pixels. This may result in the destruction of the target shape information or texture information, thus damaging the performance of object detection. Therefore, we propose saliency guided RT-DETR for object detection, which can promote the integration of original image and saliency map while preserving the details of the object. Specifically, we design a dual-stream fusion module by sending saliency-weighted images and original images as dual-stream inputs to backbone for independent analysis and utilizing cross-attention enhancement feature units to achieve feature alignment so as to facilitate feature interaction and enhancement. Then, we apply hybrid feature fusion module to effectively fuse multi-scale features to capture target information comprehensively. The proposed method achieves optimal values in all metrics of the newly designed coco dataset with improving the accuracy by 1.4% and 0.9% respectively compared with adding fusion and multiplying fusion element-by-element. Additionally, in the saliency-weighted object precision test, our model demonstrates superior performance across all metrics.
|