Open Access Paper
11 September 2023 Bidirectional sampling method for imbalanced data
Author Affiliations +
Proceedings Volume 12779, Seventh International Conference on Mechatronics and Intelligent Robotics (ICMIR 2023); 127792H (2023) https://doi.org/10.1117/12.2688900
Event: Seventh International Conference on Mechatronics and Intelligent Robotics (ICMIR 2023), 2023, Kunming, China
Abstract
Traditional over-sampling and under-sampling algorithms suffer from overfitting and high noise when unbalanced data classes are in the sample set. To improve the performance of the data classifier, this study proposes a SMOTECU algorithm combining SMOTE and ClusterCentroids under-sampling. It absorbs the advantages of both algorithms and avoids generating or rejecting excessive samples in the dataset, effectively reducing the harmful effects of overfitting and noise. We experiment with 16 unbalanced standard datasets combining three classifiers: RF, RBFNN, and RBFSVM. By comparing three evaluation metrics: F1-score, AUC, and running time, the results demonstrate that the performance of the SMOTECU-based random forest model is better, and compared with SMOTE and ClusterCentroids, SMOTECU can effectively avoid overfitting and save running time.

1.

INTRODUCTION

Classification learning is one of the important research directions of machine learning. However, in actual production lines and detection, there will be problems of data set imbalance. The classifier constructed according to the imbalanced data set will make the prediction result more biased towards the majority class, while the minority class samples are often important research objects. Therefore, reducing and eliminating this kind of imbalance problem has important significance.

In current related research, the main way to solve the data imbalance problem at the data level is to oversample the minority class samples and undersample the majority class samples. The most widely used oversampling algorithm is SMOTE algorithm proposed by Chawla et al.1, which can effectively oversample the minority class samples and make samples balanced, but this algorithm has some blindness in neighbor selection. Hui H et al.2 proposed an improved algorithm of SMOTE, the Borderline-SMOTE algorithm. This algorithm only uses minority class samples on the boundary to synthesize new samples, effectively improving the possible problem of high repetition in SMOTE, but there is a situation in that boundary samples are difficult to identify. Bao et al.3 proposed two new SMOTE algorithms, CP-SMOTE and IO-SMOTE. CP-SMOTE algorithm generates new samples by clustering to obtain center points and linearly combines minority class samples with center points. IO-SMOTE algorithm divides samples into internal samples and external samples so that more internal samples can be used in the process of generating new samples. These two algorithms make the samples away from the classification boundary and obtain better classification performance.

For undersampling methods, He Yunbin et al.4 proposed a weighted boundary point ensemble undersampling algorithm based on clustering, which effectively improved the execution efficiency of the algorithm and the accuracy of classification results. However, in some data ratio ranges, there will be a large loss of original data distribution information. Zhou Qian et al.5 proposed a distance-weighted undersampling algorithm based on adaptive k-means clustering. They used the k-means clustering method to cluster majority class samples, eliminate outliers, sort data, and sample majority class samples in order, effectively improve the impact of unbalanced data on classification accuracy. Still, this algorithm has great limitations for multi-classification problems. Wang Lei et al.6 proposed a cluster undersampling weighted random forest algorithm CUS-WRF. They used undersampling associated with clustering on the data side and weighted random forest algorithm on the algorithm side to get better classification results, But in the future, certain research is still needed in terms of time complexity and boundary sample recognition. Cui Caixia et al.7 proposed an adaptive undersampling method based on density peak clustering. They considered overlapping areas, noise, inter-class and intra-class imbalance sample sparsity degree and proposed solutions to Improve the accuracy of classification results for minority class samples, but this method is not suitable for multi-class imbalance problems.

This research proposes a SMOTECU (Synthetic Minority Over-sampling Technique Cluster Undersampling) algorithm combining undersampling and oversampling. Firstly ClusterCentroids undersampling is performed on majority classes with average sample number as target reducing majority class number and retaining feature information. Then SMOTE oversampling is performed on minority classes reducing required synthetic minority classes thus reducing model’s computational complexity and noise interference finally obtaining balanced data sets.

2.

ALGORITHM INTRODUCTION

2.1

Clustercentroids algorithm introduction

The ClusterCentroids algorithm is an under-sampling method that synthesizes the majority class samples by dividing them into K clusters using the k-means++ algorithm and replacing them with the center points of these K clusters, thereby shrinking the number of majority class samples to K. ClusterCentroids algorithm can reduce the number of samples very efficiently. Still, when the data imbalance rate is high, the number of cluster centers is too small, and there is a high chance of losing critical information, resulting in overfitting.

2.2

SMOTE algorithm introduction

SMOTE is a common sampling method that solves the problem of sample imbalance by creating synthetic samples from the minority class. It does this by interpolating between the nearest neighbors of the minority class samples to increase the number of samples and balance the sample quantity. When the data imbalance ratio is large, SMOTE may over-sample too much data, resulting in high computation cost, low information gain, and noise amplification. This is because the new data generated by SMOTE may have a high degree of overlap with the existing data

2.3

SMOTECU algorithm

To overcome the limitations of these two algorithms, this study proposes a SMOTECU algorithm that combines them: first, ClusterCentroids is used to under-sample the majority class by replacing the original data with the cluster cores after clustering, which reduces their number while preserving the feature information of the sample set; then, SMOTE is used to oversample the minority class by synthesizing new samples between neighboring ones, which increases their number. Finally, sample quantity balance is achieved, as shown in Figure 1.

Figure 1.

SMOTECU Algorithm

00129_PSISDG12779_127792H_page_3_5.jpg

Algorithm steps:

  • 1) Divide the minority class sample set Sminority and the majority class sample set Smost with the average value 00129_PSISDG12779_127792H_page_2_1.jpg of the maximum Nmax and minimum Nmin samples of all classes as the boundary;

  • 2) Set the average value m as the target sample number of the minority class sample and majority class sample;

  • 3) Use the k-means++ algorithm to cluster the majority class sample set Smost into m clusters: C=(C1, C2,…,Cm);

  • 4) Keep the cluster cores c=(c1, c2,…, cm), remove other data, and get the adjusted majority class sample set NewSmost;

  • 5) For the minority class samples Sminority, find the K nearest neighbors of each sample point according to the Euclidean distance (to prevent the sample point and the nearest neighbor line from passing through the majority class sample space, K should not be too large, this study takes K=10);

  • 6) Randomly draw t minority class samples, set the sampling rate n1 according to the distance between the sample number NumSminority and the target number m, and randomly draw n1 times in the K nearest neighbors of each sample. Set the sampling rate of the remaining NumSminority - t minority class samples to n2, and randomly draw n2 times in the K nearest neighbors of each sample;

    00129_PSISDG12779_127792H_page_3_1.jpg
    00129_PSISDG12779_127792H_page_3_2.jpg
    00129_PSISDG12779_127792H_page_3_3.jpg
    00129_PSISDG12779_127792H_page_3_4.jpg

  • 7) generate a new sample xnew randomly on the line between the minority class sample point x and the nearest neighbor xn drawn each time;

  • 8) Add the generated sample points to the original sample set to obtain the adjusted minority class sample set NewSminority;

  • 9) Balanced sample set NewS={NewSmost, NewSminority}.

The algorithm combines the ideas of ClusterCentroids under-sampling and SMOTE oversampling, using the advantages of these two algorithms to efficiently reduce or increase the number of samples to adjust the sample size. At the same time, it uses the average value of majority class and minority class as the target number, avoiding generating or eliminating too many samples. Compared with ClusterCentroids under-sampling, SMOTECU sets more clustering centers, which can retain more features and reduce the risk of overfitting. Compared with SMOTE over-sampling, SMOTECU reduces the number of samples that need to be synthesized by the minority class, shortens the calculation time of the model, and avoids too dense sample points of minority class, thereby reducing the risk of generating meaningless data and noise data.

3.

EXPERIMENTAL RESULTS AND ANALYSIS

3.1

Dataset introduction

To verify the effectiveness of SMOTECU algorithm, this study collected 16 standard datasets with imbalanced samples. Table 1 lists the information and the unbalance rate ubrate of these datasets.

Table 1.

Imbalanced dataset information

Data setFeaturesSamplesubrateData setFeaturesSamplesubrate
car_eval_3421172811.90%us_crime100199412.29%
abalone1041779.68%spectrometer9353110.80%
arrhythmia27845217.08%thyroid_sick52377215.33%
sick_euthyroid4231639.80%mammography61118342.01%
satimage3664359.28%oil4993721.85%
scene294240712.60%optical_digits6456209.14%
solar_flare_m032138919.43%ozone_level72253633.74%
wine_quality11489825.77%pen_digits16109929.42%

3.2

Performance comparison of different algorithms

For imbalanced datasets, the classification results tend to be biased towards the majority class. Therefore, relying solely on accuracy to evaluate classification performance is one-sided and cannot accurately measure the generalization ability of the classification model. In this study, the standard metrics for classification problems, AUC and F1-score were used to evaluate the classification performance of the classifier. We calculate the AUC8 and F1-score values by the confusion matrix of the classification results, and the closer the values are to 1, the better the classification performance.

Table 2.

Confusion Matrix

Confusion MatrixPredict
1(Positive)0(Negative)
Actual1(Positive)TP(True Positive)FN(Fales Negative)
0(Negative)FP(Fales Positive)TN(True Negative)
00129_PSISDG12779_127792H_page_4_1.jpg
00129_PSISDG12779_127792H_page_4_2.jpg
00129_PSISDG12779_127792H_page_4_3.jpg
00129_PSISDG12779_127792H_page_4_4.jpg

Then, this research uses Random Forest, RBF neural network (RBFNN), and support vector machine based on RBF (RBFSVM) to compare the classification effects of 16 datasets processed by SMOTE, ClusterCentroid,s and SMOTECU algorithms. Divide the training and testing into a 7:3 ratio and repeat ten times. The average values of the F1-score and AUC are shown in Table 3.

Table 3.

Comparison of F1-score and AUC values for partial datasets

Sequence of DatasetsAlgorithmSMOTEClusterCentroidsSMOTECU
F1AUCF1AUCF1AUC
car_eval_34RF0.99520.99441.00001.00001.00001.0000
RBFNN0.98330.98161.00001.00000.99490.9939
RBFSVM0.97570.97490.98770.98750.98420.9826
sick_euthyroidRF0.98110.98110.95600.95450.97730.9762
RBFNN0.93520.92910.82290.82420.87130.8473
RBFSVM0.81990.78980.84000.81820.82870.7809
satimageRF0.97690.97560.89470.89390.96270.9574
RBFNN0.91880.90750.73650.75160.88020.8507
RBFSVM0.86630.84140.81250.77660.85570.8076
sceneRF0.93090.92750.85450.85010.94010.9397
RBFNN0.90280.88320.73210.72540.91710.9101
RBFSVM0.80490.78010.77690.74530.82990.8088
wine_qualityRF0.96770.96680.94340.94570.94650.9454
RBFNN0.88730.88690.65120.71660.87450.8676
RBFSVM0.74140.75490.47220.65450.73870.7470
mammographyRF0.97700.97680.89030.89110.97330.9724
RBFNN0.92230.92390.86900.87900.91700.9164
RBFSVM0.90340.90500.76300.73720.87460.8800
oilRF0.99080.99080.90910.91610.97350.9717
RBFNN0.96740.96300.66670.72220.96100.9492
RBFSVM0.91370.91100.73330.66670.87970.8668
ozone_levelRF0.98070.98040.97560.97620.97460.9742
RBFNN0.89930.87970.88890.88310.89810.8835
RBFSVM0.87680.87040.84440.84090.86620.8570

According to the test results of the datasets in Table 3, Random Forest has the best classification performance, followed by RBF Neural Network, and RBFSVM performs the worst. Regarding runtime, RBF Neural Network takes much longer than Random Forest and RBFSVM.

For the test results, the classification performance of the SMOTECU-based models on the car_eval_34 and the scene datasets is better than that of SMOTE and ClusterCentroids. In the other test results, the classification results based on SMOTECU are slightly worse than SMOTE. In the random forest classifier, the SMOTECU algorithm can reduce the computational complexity and maintain high classification performance compared with SMOTE oversampling, and can effectively avoid the overfitting phenomenon (100% accuracy and less than 2 seconds running time) compared with ClusterCentroids. Moreover, the RBF Neural Network classification model based on SMOTECU can significantly reduce the runtime while maintaining high accuracy.

3.3

Analysis of dataset feature space

To further investigate the applicability of the SMOTECU algorithm, we use two dimensionality reduction algorithms, t-SNE9 and UMAP10, to reduce the high-dimensional feature space of these 16 datasets to two dimensions. Then the classification performance was compared to further analyze the results. The dimensionality reduction diagram of some datasets is shown in Figure 2.

Figure 2.

T-SNE and UMAP dimensionality reduction for some of the datasets

00129_PSISDG12779_127792H_page_6_1.jpg

Through comparing the feature space dimensionality reduction of the tested data set, it was found that SMOTECU is good at dealing with data sets where sample points are more dispersed on the t-SNE dimensionality reduction map, and the minority and majority class are clustered separately with high distinguishability on the UMAP dimensionality reduction map. However, the classification performance based on SMOTECU is significantly worse than that of the SMOTE oversampled data set, where the sample points are linearly distributed on the t-SNE dimensionality reduction map, and the sample points of different labels are more chaotic and not clearly distinguished on the UMAP dimensionality reduction map.

4.

CONCLUSION

This study addresses the problem of imbalanced data classification and proposes the SMOTRECU algorithm from the perspective of data. The algorithm combines over-sampling and under-sampling methods to achieve data balancing. Firstly, the majority class samples are clustered by the k-means++ clustering method and replaced with cluster centroids, reducing the number of samples while retaining the main features of the data. Then, SMOTE over-sampling is applied to minority samples, reducing the number of generated sample points and mitigating the negative impact of traditional SMOTE over-sampling on excessive sample generation and noise amplification. By adjusting the number of minority and majority samples simultaneously, the algorithm makes the dataset structure more reasonable and effectively reduces the risk of overfitting. The algorithm has been tested on standard datasets, demonstrating good classification performance on highly discriminative data and random forest classification models and saving runtime for RBF neural networks. However, the advantages of the SMOTRECU algorithm are insignificant for imbalanced datasets with low discriminability, and further research is needed in this regard.

REFERENCES

[1] 

Chawla, N. V., Bowyer, K. W., Hall, L. O., et al., “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, 16 321 –357 (2002). https://doi.org/10.1613/jair.953 Google Scholar

[2] 

Han, H., Wang, W. Y., Mao, B. H., “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” in In Advances in Intelligent Computing: International Conference on Intelligent Computing, ICICProceedings, Part I, 23878 –26887 (2005). Google Scholar

[3] 

Bao, Y., Yang, S. B., “Two Novel SMOTE Methods for Solving Imbalanced Classification Problems,” IEEE Access, 11 5816 –5823 (2023). https://doi.org/10.1109/ACCESS.2023.3236794 Google Scholar

[4] 

He, Y., Leng, X., Wan, J, “Unbalanced data weighted boundary point integration undersampling method,” Journal of Xidian University, 48 (4), 176 (2021). Google Scholar

[5] 

Zhou, Q., Yao, Z., B. Sun, B., “Under-sampling Algorithm withWeighted Distance Based on Adaptive K-Means Clustering,” Data Analysis and Knowledge Discovery, 6 (5), 127 –136 (2022). Google Scholar

[6] 

Wang, L., Liu, Y., Liu, Z., et al., “Clustering under-sampling weighted random forest algorithm for processing unbalanced data,” Application Research of Computers, 38 (5), 1398 –1402 (2021). Google Scholar

[7] 

Cui, C., Cao, F., Liang, J., “Adaptive Undersampling Based on Density Peak Clustering, Pattern Recognition and,” Artificial Intelligence, 33 (9), 811 –819 (2020). Google Scholar

[8] 

Shen, Z., Hua, X., Jinhai, C, “Resampling Algorithms for Imbalanced Data Author,” Journal of Small and Microcomputer Systems, 1 –8 (2023). Google Scholar

[9] 

Wei, V., Ivkin, N., Braverman, V., et al., “Sketch and Scale Geo-distributed tSNE and UMAP,” in IEEE International Conference on Big Data, 996 –1003 (2020). Google Scholar

[10] 

Sainburg, T., McInnes, L., Gentner, T. Q., “Parametric UMAP Embeddings for Representation and Semisupervised Learning,” Neural Computation, 33 (11), 2881 –2907 (2021). Google Scholar
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Junjie Shi, Deyu Song, Shengyao Zheng, Yueming Hu, Shuangshuang Chen, and Fengque Pei "Bidirectional sampling method for imbalanced data", Proc. SPIE 12779, Seventh International Conference on Mechatronics and Intelligent Robotics (ICMIR 2023), 127792H (11 September 2023); https://doi.org/10.1117/12.2688900
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Detection and tracking algorithms

Data modeling

Random forests

Overfitting

Evolutionary algorithms

Neural networks

Performance modeling

Back to Top