Open Access
16 November 2023 Sequestration of imaging studies in MIDRC: stratified sampling to balance demographic characteristics of patients in a multi-institutional data commons
Author Affiliations +
Abstract

Purpose

The Medical Imaging and Data Resource Center (MIDRC) is a multi-institutional effort to accelerate medical imaging machine intelligence research and create a publicly available image repository/commons as well as a sequestered commons for performance evaluation and benchmarking of algorithms. After de-identification, approximately 80% of the medical images and associated metadata become part of the open commons and 20% are sequestered from the open commons. To ensure that both commons are representative of the population available, we introduced a stratified sampling method to balance the demographic characteristics across the two datasets.

Approach

Our method uses multi-dimensional stratified sampling where several demographic variables of interest are sequentially used to separate the data into individual strata, each representing a unique combination of variables. Within each resulting stratum, patients are assigned to the open or sequestered commons. This algorithm was used on an example dataset containing 5000 patients using the variables of race, age, sex at birth, ethnicity, COVID-19 status, and image modality and compared resulting demographic distributions to naïve random sampling of the dataset over 2000 independent trials.

Results

Resulting prevalence of each demographic variable matched the prevalence from the input dataset within one standard deviation. Mann–Whitney U test results supported the hypothesis that sequestration by stratified sampling provided more balanced subsets than naïve randomization, except for demographic subcategories with very low prevalence.

Conclusions

The developed multi-dimensional stratified sampling algorithm can partition a large dataset while maintaining balance across several variables, superior to the balance achieved from naïve randomization.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Natalie Baughan, Heather M. Whitney, Karen Drukker, Berkman Sahiner, Tingting Hu, Hyun J. Grace Kim, Michael F. McNitt-Gray, Kyle J. Myers, and Maryellen L. Giger "Sequestration of imaging studies in MIDRC: stratified sampling to balance demographic characteristics of patients in a multi-institutional data commons," Journal of Medical Imaging 10(6), 064501 (16 November 2023). https://doi.org/10.1117/1.JMI.10.6.064501
Received: 25 January 2023; Accepted: 25 October 2023; Published: 16 November 2023
Advertisement
Advertisement
KEYWORDS
Algorithm development

COVID 19

Medical imaging

Evolutionary algorithms

Histograms

Artificial intelligence

Databases

Back to Top