In typical classification problems the data used to train a model for each class is often correctly labeled, and so that fully
supervised learning can be utilized. For example, many illustrative labeled data sets can be found at sources such as the
UCI Repository for Machine Learning (http://archive.ics.uci.edu/ml/), or at the Keel Data Set Repository
(http://www.keel.es). However, increasingly many real world classification problems involve data that contain both
labeled and unlabeled samples. In the latter case, the data samples are assumed to be missing all class label information,
and when used as training data these samples are considered to be of unknown origin (i.e., to the learning system, actual
class membership is completely unknown). Typically, when presented with a classification problem containing both
labeled and unlabeled training samples, a technique that is often used is to throw out the unlabeled data. In other words,
the unlabeled data are not included with existing labeled data for learning, and which can result in a poorly trained
classifier that does not reach its full performance potential. In most cases, the primary reason that unlabeled data are not
often used for training is that, and depending on the classifier, the correct optimal model for semi-supervised
classification (i.e., a classifier that learns class membership using both labeled and unlabeled samples) can be far too
complicated to develop.
In previous work, results were shown based on the fusion of binary classifiers to improve performance in multiclass
classification problems. In this case, Bayesian methods were used to fuse binary classifier fusion outputs, while selecting
the most relevant classifier pairs to improve the overall classifier decision space. Here, this work is extended by
developing new algorithms for improving semi-supervised classification performance. Results are demonstrated with
real data form the UCI and Keel Repositories.
|