To evaluate the performance of speaker recognition systems, a detection cost function defined as a
weighted sum of the probabilities of type I and type II errors is employed. The speaker datasets may
have data dependency due to multiple uses of the same subjects. Using the standard errors of the
detection cost function computed by means of the two-layer nonparametric two-sample bootstrap
method, a significance test is performed to determine whether the difference between the measured
performance levels of two speaker recognition algorithms is statistically significant. While
conducting the significance test, the correlation coefficient between two systems’ detection cost
functions is taken into account. Examples are provided.
The National Institute of Standards and Technology conducts an ongoing series of Speaker
Recognition Evaluations (SRE). Speaker detection performance is measured using a detection cost
function defined as a weighted sum of the probabilities of type I and type II errors. The sampling
variability can result in measurement uncertainties. In our prior study, the data independency was
assumed in using the nonparametric two-sample bootstrap method to compute the standard errors
(SE) of the detection cost function based on our extensive bootstrap variability studies in ROC
analysis on large datasets. In this article, the data dependency caused by multiple uses of the same
subjects is taken into account. The data are grouped into target sets and non-target sets, and each set
contains multiple scores. One-layer and two-layer bootstrap methods are proposed based on whether
the two-sample bootstrap resampling takes place only on target sets and non-target sets, or
subsequently on target scores and non-target scores within the sets, respectively. The SEs of the
detection cost function using these two methods along with those with the assumption of data
independency are compared. It is found that the data dependency increases both estimated SEs and
the variations of SEs. Some suggestions regarding the test design are provided.
The National Institute of Standards and Technology (NIST) Speaker Recognition Evaluations (SRE)
are an ongoing series of projects conducted by NIST. In the NIST SRE, speaker detection
performance is measured using a detection cost function, which is defined as a weighted sum of
probabilities of type I error and type II error. The sampling variability can result in measurement
uncertainties of the detection cost function. Hence, while evaluating and comparing the
performances of speaker recognition systems, the uncertainties of measures must be taken into
account. In this article, the uncertainties of detection cost functions in terms of standard errors (SE)
and confidence intervals are computed using the nonparametric two-sample bootstrap methods based
on our extensive bootstrap variability studies on large datasets conducted before. The data
independence is assumed because the bootstrap results of SEs matched very well with the analytical
results of SEs using the Mann-Whitney statistic for independent and identically distributed samples
if the metric of area under a receiver operating characteristic curve is employed. Examples are
provided.
To evaluate the performance of fingerprint-image matching algorithms on large datasets, a receiver
operating characteristic (ROC) curve is applied. From the operational perspective, the true accept
rate (TAR) of the genuine scores at a specified false accept rate (FAR) of the impostor scores and/or
the equal error rate (EER) are often employed. Using the standard errors of these metrics computed
using the nonparametric two-sample bootstrap based on our studies of bootstrap variability on large
fingerprint datasets, the significance test is performed to determine whether the difference between
the performance of one algorithm and a hypothesized value, or the difference between the
performances of two algorithms where the correlation is taken into account is statistically significant.
In the case that the alternative hypothesis is accepted, the sign of the difference is employed to
determine which is better than the other. Examples are provided.
From 1996 through 2008, the NIST Speaker Recognition Evaluations have focused on the task of automatic speaker
detection based on recorded segments of spontaneous conversational speech. Earlier evaluations were limited to English
language telephone speech. More recent evaluations (2004-2008) have included some conversational telephone speech in
multiple languages, with the 2008 evaluation including 24 different languages. These recent evaluations have also
explored cross channel effects by including phone conversations recorded over multiple microphone channels, and the
2008 evaluation also examined interview type speech recorded over multiple microphone channels. The considerable
progress observed over the period of these evaluations has made the technology potentially useful for detecting
individuals of interest in certain applications. Performance capability is measurably affected by a number of situational
factors, including the number and duration of the training speech segments available, the durations of the test speech
segments available, the language(s) spoken in these segments, and the types and variability of the recording channels
involved.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.