|
1.INTRODUCTIONAs an important and challenging topic in computer vision, semantic segmentation marks semantic labels for every pixel in the image. Like many other applications, semantic segmentation has achieved impressive progress in recent years benefitting from the success of Deep Neural Networks (DNNs)1,2. Shelhamer et al.3 proposed Fully Convolutional Network (FCN), which is a pioneering work in semantic segmentation using DNNs. Since then, the FCN-based approach2,4 has been used in various segmentation scenarios. Relying on learnable convolutions, this kind of method can capture rich semantic information. However, the results are still not satisfactory. An important reason is that the localization of the convolution operation cannot utilize the global information in the image under study. To address this problem, inspired by the technique in Natural Language Processing (NLP), Wang et al.5 designed a simple and efficient Non-local Block (NLB) that combines non-local means6 and CNN successfully. The work first introduced self-attention into computer vision and thus becomes a milestone in semantic segmentation. Meanwhile, the building block, i.e., NLB, can be plugged into many existing DNNs to improve their performance in applications. Thus, researchers began to devote more and more attention to it. The subsequent works mainly focus on reducing the complexity of the block7,8. In this paper, motivated by an earlier work9 about the General Non-Local denoising model based on Multi-kernel-induced Measures (GNLMKIM), we design a novel non-local block called Multi-kernel Non-local Block (MKNLB). With multi-kernel strategy, MKNLB detects edges more efficiently and thus gets more powerful performance in segmentation. Additionally, with the distributive law of matrix multiplication, the computational burden of MKNLB is comparable to that of the standard NLB. The effectiveness of the method is investigated using two benchmark semantic segmentation datasets (Cityscapes10 and ADE20K11). With the indicator of mean intersection over union (mIoU), our approach significantly outperforms the methods using the standard NLB. 2.RELATED WORKS2.1Multi-kernel model for non-local meansNon-local Means (NLM)6 is a classical filter that utilizes the dissimilarity measure between patches to operate in a non-local area (even the entire image). Actually, the mathematical model for non-local means is not unique. For example, in Reference9, GNLMKIM employs multi Gaussian kernels to define the measure and applies Shannon regularizer to balance the linear relationship between various kernels. The specific model can be defined by: where, i and j denote the pixel positions from the definition domain Ω of image x; xi and xj are the two corresponding image patches;Gt(t = 1,…,k)is the Gaussian kernel used to measure the similarity between the two patches. Naturally, becomes the dissimilarity between patches. λt can be viewed as the importance of the single kernel Gt.p represents the regularization parameter that trades off the two terms of the model. As declared in Reference9, the outputs of NLM can be derived from the optimization model under the single-kernel case. Meanwhile, with multi-kernel methods, the filters derived from the above model usually get more powerful ability in edge detecting. It is the fact that motivates us to modify NLB with multi-kernel strategy. 2.2Non-local blockNon-local block, i.e., NLB5, captures the long-range dependencies of pixels and thus becomes key to semantic segmentation. Specifically, the block is defined as: where f (xi, xj) denotes the similarity between position i and j in the input and S(x) is the normalization factor. Wz and Wg are two 1 × 1 convolutions respectively. For the similarity f, we take embedded Gaussian as an example to describe the definition in detail. Under this condition, , θ(xi) = Wθxi, φ(xj) = Wφxj, , and Wθ are two 1 × 1 convolutions. , H, W respectively indicate their channel number, input width and input height. For an input (C indicates the input channel number), the standard NLB with embedded Gaussian is shown in Figure 1a. Here, we intend to design a multi-kernel version of NLB, compared with the original one, which can get more powerful performance when being used in semantic segmentation. 3.MULTI-KERNEL NON-LOCAL BLOCKHere, we first give the definition of our multi-kernel non-local block (i.e., MKNLB) in Section 3.1. Then, by analyzing the complexity, its efficient implementation is designed in Section 3.2. 3.1Initial definitionAs aforementioned, the standard NLB is based on the single Gaussian kernel and the existing work indicated that a multi-Gaussian kernel generally gets better behaviors in edges9. The fact motives us to extend NLB to its multi-kernel version (named MKNLB) to improve the ability in segmentation. For simplicity, we take two kernels in our MKNLB. The definition is where , are the two Gaussian kernels, σ(xj = Wσxj, Wσ is a 1 × 1 convolution, other symbols all have the same meanings as those in NLB formulated in equation (2). For clarity, the definition of MKNLB is illustrated in figure 1b. Its effectiveness in semantic segmentation will be validated in the experiments part. Considering the multi-kernel strategy may increase the computational burden at first glance, we design an efficient implementation of MKNLB next, whose complexity is comparable to the standard NLB. 3.2Efficient implementationAs shown in figure 1a, the similarity calculation with matrix multiplication, i.e., θ(xi) × φ(xj), is the main computational burden in non-local block. Similarly in Reference7, the operation can be simplified and expressed as: where N = H × W. Therefore, this multiplication of matrices has a complexity of . In our MKNLB defined in Figure 1b, due to the participation of the two kernels, the similarity part becomes θ(xi) × φ(xj) + θ(xi) × σ(xj). At first glance, the computational burden may increase. Fortunately, it can be reduced with the distributive law of matrix multiplication. The specific expression is changed as follows: Equation (5) indicates that, MKNLB can be implemented by performing the two-matrix addition first and then followed by one time matrix multiplication. Considering the former complexity is distinctly lower than the latter. Therefore, this implementation of MKNLB can be approximated as . That is, the complexity of our MKNLB is comparable to that of the standard NLB. For clarity, the implementation is illustrated in Figure 1c. 4.EXPERIMENTSTo evaluate the MKNLB, we conduct experiments for semantic segmentation on two benchmark datasets: Cityscapes10 and ADE20K11. 4.1Datasets and evaluation metricsCityscapes: It consists of 5000 images from 50 different cities belonging 19 categories. In order to facilitate training, validation, and testing, the images have been divided into 2975, 500, and 1525 segments, respectively ADE20K: The dataset contains 20210 images in the training dataset with 150 semantic classes, 2000 images make up the validation set, while 3352 make up the test set. As well known, the dataset is particularly challenging in semantic segmentation datasets due to complex scenarios. Metrics: The mean intersection over union (mIoU) is used to evaluate all datasets. 4.2Training detailsDuring training, our code follows a standard frame from the semantic segmentation open resource library MMSegmentation12. Two Quadro Rtx 6000 GPUs are used for all experiments. We apply stochastic gradient descent (SGD) with the weight decay is 0.0005. The initial learning rate γ0 = 0.01 is decayed following the poly learning rate policy, where γ0 is multiplied by . For Cityscapes, we set the batch size is 4 and randomly crop the input images to 512×512. For ADE20K, we set the batch size is 8 and randomly crop the input images to 512×1024. For the two datasets, we choose random flip and scale these images within [0.5, 2]. For all experiments, we select the pre-trained ResNet-101 as backbone framework. 4.3Comparisons with other methodsIn this section, we will analyze the results of the two datasets (Cityscapes10 and ADE20K11) for semantic segmentation. On one hand, we compare the proposed multi-kernel non-local network with five other methods on the Cityscapes validation set. Table 1 is shown the experiment outcomes of mIoU numbers. We trained all methods for 8K iterations. Based on the same backbone, the multi-kernel non-local network attains 77.59% mIoU. It can be observed that 2.68% mIoU better than the original non-local network. We also find that our method performs better than the previous methods by more than 0.71% mIoU. Table 1.Comparisons with the state-of-the-arts on the Cityscapes validation set.
On the other hand, we compare the performance of our method on the ADE20K validation set. The outcomes of mIoU numbers are shown in Table 2. We trained 8K iterations in this dataset to compare the performance with other methods. As we all know, the dataset is challenging to train due to a variety of image sizes, complex semantic information, and the difference between training and validation sets. Despite under this condition, our method also achieves 41.35% mIoU. It can still be 1.03% better than the original non-local network and also defeat other listed methods. Table 2.Comparisons with the state-of-the-arts on the ADE20K validation set.
5.CONCLUSIONIn this paper, we design an efficient block called Multi-kernel Non-local Block (MKNLB) for semantic segmentation. In contrast to the standard Non-local block (NLB), the proposed MKNLB detects edges more efficiently and thus gets better performance when being used in image segmentation. Meanwhile, with the distributive law of matrix multiplication, we design an efficient implementation of MKNLB, whose complexity is comparable to the standard NLB. The segmentation experiments conducted on benchmark datasets (Cityscapes and ADE20K) validated its effectiveness. For future work, we would like to expand the applications of MKNLB to further vision tasks ACKNOWLEDGMENTSThis work was supported in part by the National Natural Science Foundation of China under Grant 11801249; the Nature Science Foundation of Shandong Province China under Grant ZR2020MF040, and in part by the Open Project of Liaoch eng University under Grant 319312101-01. REFERENCESBadrinarayanan, V., Kendall, A. and Cipolla, R.,
“SegNet: A deep convolutional encoder-decoder architecture for image segmentation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 2481
–95
(2017). https://doi.org/10.1109/TPAMI.34 Google Scholar
Chen, L. C., Papandreou, G., Kokkinos, I., et al.,
“DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,”
IEEE Transactions on Pattern Analysis and Machine, 40 834
–48
(2018). https://doi.org/10.1109/TPAMI.2017.2699184 Google Scholar
Shelhamer, E., Long, J. and Darrell, T.,
“Fully convolutional networks for semantic segmentation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 3431
–3440
(2017). https://doi.org/10.1109/TPAMI.2016.2572683 Google Scholar
Lin, G., Milan, A., et al.,
“RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,”
in Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
1925
–34
(2017). Google Scholar
Wang, X., Girshick, R., He, K., et al.,
“Non-local neural networks,”
in Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
7794
–803
(2018). Google Scholar
Buades, A., Coll, B. and Morel, J. M.,
“A non-local algorithm for image denoising,”
in Proc. IEEE Conf. on Computer Vision and Pattern Recognition,
60
–5
(2005). Google Scholar
Zhu, Z., Xu, M., Bai, X., et al.,
“Asymmetric non-local neural networks for semantic segmentation,”
in IEEE/CVF Int. Conf. on Computer Vision,
593
–602
(2019). Google Scholar
Cao, Y., Xu, J., Lin, S., et al.,
“GCNet: Non-local networks meet squeeze-excitation networks and beyond,”
in Proc. IEEE/CVF Int Conf. on Computer Vision Workshops,
(2019). https://doi.org/10.1109/ICCVW48693.2019 Google Scholar
Sun, Z., Chen, S. and Qiao, L.,
“A general non-local denoising model using multi-kernel-induced measures,”
Pattern Recognit, 47 1751
–63
(2014). https://doi.org/10.1016/j.patcog.2013.11.003 Google Scholar
Cordts, M., Omran, M., Rehfeld, T., et al.,
“The cityscapes dataset for semantic urban scene understanding,”
in Pro. IEEE Conf. on Computer Vision and Pattern Recognition,
3213
–23
(2016). Google Scholar
Zhou, B., Zhao, H., Puig, X., et al.,
“Scene parsing through ADE20K dataset,”
in Pro. IEEE Conf. on Computer Vision and Pattern Recognition,
5122
–30
(2017). Google Scholar
.MMSegmentation Contributors,”
MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark,
(2020) https://github.com/open-mmlab/mmsegmentation Google Scholar
Yin, M., Yao, Z., Cao, Y., et al.,
“Disentangled non-local neural networks,”
in European Conference on Computer Vision,
191
–207
(2020). Google Scholar
Huang, Z., Wang, X., Huang, L., et al.,
“Ccnet: Criss-cross attention for semantic segmentation,”
in Proc. of the IEEE/CVF Inter. Conf. on Computer Vision,
603
–12
(2019). Google Scholar
|