Proceedings Article | 3 March 2009
KEYWORDS: Tissues, Databases, Feature extraction, Breast, Image classification, Cancer, Image retrieval, Breast cancer, Content based image retrieval, Lithium
Distance metrics are often used as a way to compare the similarity of two objects, each represented by a set of
features in high-dimensional space. The Euclidean metric is a popular distance metric, employed for a variety of
applications. Non-Euclidean distance metrics have also been proposed, and the choice of distance metric for any
specific application or domain is a non-trivial task. Furthermore, most distance metrics treat each dimension or
object feature as having the same relative importance in determining object similarity. In many applications,
such as in Content-Based Image Retrieval (CBIR), where images are quantified and then compared according
to their image content, it may be beneficial to utilize a similarity metric where features are weighted according
to their ability to distinguish between object classes. In the CBIR paradigm, every image is represented as
a vector of quantitative feature values derived from the image content, and a similarity measure is applied
to determine which of the database images is most similar to the query. In this work, we present a boosted
distance metric (BDM), where individual features are weighted according to their discriminatory power, and
compare the performance of this metric to 9 other traditional distance metrics in a CBIR system for digital
histopathology. We apply our system to three different breast tissue histology cohorts - (1) 54 breast histology
studies corresponding to benign and cancerous images, (2) 36 breast cancer studies corresponding to low and
high Bloom-Richardson (BR) grades, and (3) 41 breast cancer studies with high and low levels of lymphocytic
infiltration. Over all 3 data cohorts, the BDM performs better compared to 9 traditional metrics, with a greater
area under the precision-recall curve. In addition, we performed SVM classification using the BDM along with
the traditional metrics, and found that the boosted metric achieves a higher classification accuracy (over 96%)
in distinguishing between the tissue classes in each of 3 data cohorts considered. The 10 different similarity
metrics were also used to generate similarity matrices between all samples in each of the 3 cohorts. For each
cohort, each of the 10 similarity matrices were subjected to normalized cuts, resulting in a reduced dimensional
representation of the data samples. The BDM resulted in the best discrimination between tissue classes in the
reduced embedding space.