Scientists use machine learning models to help identify long COVID patients
18 May 2022
A study shows how the National COVID Cohort Collaborative used XGBoost machine learning models to better define long COVID and identify potential long-COVID patients with a high degree of accuracy.
Clinical scientists used machine learning (ML) models to explore de-identified electronic health record (EHR) data in the National COVID Cohort Collaborative (N3C), a National Institutes of Health-funded national clinical database, to help discern characteristics of people with long-COVID and factors that may help identify such patients using data from medical records.
The findings, published in The Lancet Digital Health, have the potential to improve clinical research on long COVID and inform a more standardised care regimen for the condition.
“Characterising, diagnosing, treating and caring for long-COVID patients has proven to be a challenge due to the list of characteristic symptoms continuously evolving over time,” said first author Emily R. Pfaff, PhD, Assistant Professor in the Division of Endocrinology and Metabolism at the UNC School of Medicine.
“We needed to gain a better understanding of the complexities of long-COVID, and for that, it made sense to take advantage of modern data analysis tools and a unique big data resource like N3C, where many features of long COVID are represented.”
Sponsored by the National Institutes of Health’s National Center for Advancing Translational Sciences (NCATS), the N3C data enclave currently includes information representing more than 13 million people from 72 sites nationwide, including nearly five million COVID-19-positive cases. The resource enables rapid research on emerging questions about COVID-19 vaccines, therapies, risk factors and health outcomes.
This new research is part of the National Institutes of Health’s Researching COVID to Enhance Recovery (RECOVER) initiative, which has been recruiting thousands of participants nationwide in order to answer critical research questions about the syndrome to accurately identify who has long-COVID, risk factors for long-COVID, and potential interventions and treatments.
Using the N3C, researchers developed XGBoost machine learning (ML) models to understand patient characteristics and better identify potential long-COVID patients.
Researchers examined demographics, healthcare utilisation, diagnoses, and medications for 97,995 adult COVID-19 patients. They used these features on nearly 600 long-COVID patients from three long-COVID speciality clinics to train and test three ML models, which focused on identifying potential long COVID patients in three groups:: among all COVID-19 patients, among patients hospitalised with COVID-19, and among patients who had COVID-19 but were not hospitalised.
The models proved to be accurate in identifying potential long-COVID patients. Patients flagged by the models can be interpreted as “patients warranting care at a long-COVID speciality clinic.”
The models also showed many important features that differentiate potential long-COVID patients from non-long-COVID patients.
They focused on patients with a positive COVID diagnosis who were at least 90 days out from their acute infection. Features more commonly identified among potential long-COVID patients include post-COVID respiratory symptoms and associated treatments, non-respiratory symptoms widely reported as part of long COVID (such as sleep disorders, anxiety, malaise, chest pain, and constipation), pre-existing risk factors for greater acute COVID severity (such as chronic pulmonary disease, diabetes, and chronic kidney disease), and proxies for hospitalisation, suggesting greater severity of acute COVID.
The study also points out that it is plausible that long-COVID will not ultimately have a single definition, and may be better described as a set of related conditions with their own symptoms, trajectories, and treatments.
Josh Fessel, MD, PhD, Senior Clinical Advisor at NCATS and a Scientific Program Lead in RECOVER, added, “Once you’re able to determine who has long COVID in a large database of people, you can begin to ask questions about those people. Was there something different about those people before they developed long COVID? Did they have certain risk factors? Was there something about how they were treated during acute COVID that might have increased or decreased their risk for long COVID?”
The study included how electronic health record (EHR) data is skewed toward patients who make more use of healthcare systems. Pfaff says that it is essential to acknowledge whose data is less likely to be represented – uninsured patients, patients with limited access to or ability to pay for care, or patients seeking care at small practices or community hospitals with limited data exchange capabilities.
“Electronic Health Records (EHRs) only have information for people who go to the doctor,” said Pfaff, who is also Co-Director of the NC TraCS Informatics and Data Science (IDSci) Program. “They also have more information on people who go to the doctor a lot. So, people who don’t have good access to care or people who don’t go to the doctor, we’re just not going to have information about them. So this is a caveat that I offer with every EHR based study that I do. We need to recognise who’s not in the dataset.”
The N3C team continues to refine its models as more real-world data emerges. Their longitudinal data for COVID-19 patients can provide a comprehensive foundation for the development of ML models to identify potential long-COVID patients.
As larger cohorts of long-COVID patients are established, future work will include research to identify subtypes of long-COVID, making the condition easier to study and treat.