Diagnosis code-based algorithms for comorbidity identification previously validated in administrative claims data have different sensitivities and positive predictive values (PPV) when applied in electronic health records (EHR). Novel algorithms leveraging natural language processing (NLP) of unstructured EHR data may improve accuracy beyond billing codes. We aimed to 1) define the design process for efficient comorbidity classification algorithms leveraging NLP and ICD codes, and 2) apply it to a case study of HIV status in an oncology EHR.
We developed a framework to optimize an NLP classification algorithm: identify more potential cases (nNLP) than ICD codes alone (ncodes), pre-specify a minimum PPV threshold, iteratively test combinations of phrases to identify the comorbidity, validate with manual chart abstraction, and assess PPV. This proof-of-concept study applied the framework to predict HIV status among 2.2 million oncology patients in the Flatiron Health EHR-derived database. Iterations continued until PPV>70% and nNLP>ncodes. Internal validation by manual chart abstraction confirmed status of a random sample with HIV diagnosis codes, and the NLP classification sensitivity was assessed within this sample.
Five iteration cycles optimized an NLP algorithm using 9 core phrases with 40 permutations. Overall (n=2.2 million), NLP classified more potential HIV-cases (n=11,063) than diagnosis codes (n=4,592), with 3,452 patients classified by both approaches. Internal validation estimated a 77% PPV (69/90) for the NLP algorithm and >99% for the ICD code algorithm (15/15). Applied to the ICD code-identified cohort, the NLP algorithm sensitivity was 75%.
These findings suggest that NLP can supplement ICD codes to improve the sensitivity of EHR comorbidity classification algorithms, albeit with lower PPV. This framework needs further internal and external validation to evaluate specificity and sensitivity. These results highlight the potential value of NLP approaches in defining research cohorts in EHR populations.
© 2019 Published by Elsevier Inc.