Real-world clinical health outcomes are usually not balanced or evenly distributed. For example, the incidence of severe adverse drug event is often less than 5%. Imbalanced classes of outcome create significant bias in developing machine-learning (ML) prediction models. We developed a novel artificial intelligence (AI) algorithm (SynSam) that generates synthetic samples to boost samples with infrequent health outcome occurrence. The algorithm can handle clinical health data with both continuous and categorical predictors. In this study, we compared a ML model prediction performance with the SynSam sampling to random over-sampling (bootstrap), and to no over-sampling (naïve).
We simulated a virtual patient cohort (N=50,000) with a 1% adverse drug event occurrence. Using NHANES data we also assembled a cohort of adults who suffered from asthma (N=6177) for the prediction of emergency department visit due to asthma (rate = 9%). With split-validation design, we set aside 20% random sample of each cohorts as an independent validation dataset and used the rest of 80% for prediction model training with the Extreme Gradient Boosting algorithm. We applied the 5-fold cross validation for final model selection. We compared the performance of the final prediction model with respective over-sampling approaches for the validation datasets.
In the virtual patient cohort, the naïve model correctly predicted the events 55% of the time (sensitivity) while predicting all non-events (specificity). Application of bootstrap increased sensitivity to 70%. The SynSam approach increased the sensitivity to 90% while maintaining the specificity at 96%. In the NHANES cohort, the naïve model had a 19% sensitivity and 95% specificity. The bootstrap approach increased the sensitivity to 24%. The SynSam approach achieved the sensitivity of 51% with a specificity of 88%.
The AI-based SynSam approach may be useful to boost ML prediction model performance for infrequent health outcomes.
© 2020 Published by Elsevier Inc.