In Bioinformatics, batch effect detection is a challenging task where the clustering approaches have been explored most of the time. In this study, we proposed a novel approach to identify batch effects and visualization with unsupervised analysis methods. We used the most significant gene sets 500,1500, and 2500 genes out of 35238 genes for the human-liver RNA seq dataset by applying standard deviation (SD). The skmeans and kmeans methods were explored on the selected gene subsets. Then, principal component analysis (PCA) was used for embedding to the 10-dimensional subspace. Finally, the Uniform Manifold Approximation and Project (UMAP) was applied to cluster and visualize the outputs. The experimental results demonstrate the robust representation and achieve the best clustering and visualization for features extracted from 1500 genes. These findings are not only useful for batch effect detection and removal tasks but also can be used to label new samples to train the supervised machine learning methods.
|