Machine Learning for Healthcare Research at the University of Chicago
The presence of heterogeneity in complex conditions indicates the potential for both differing underlying biology / disease etiologies as well as environmental effects. We aim to use machine learning methods to better characterize phenotypes in order to exploit this heterogeneity to better understand biology and/or environmental factors of disease.
The rapid advancement of AI tools in clinical care demands robust strategies in implementation sciences, effective deployment, enhanced physician-computer interaction, and thorough impact assessment. As the creation of cutting-edge models becomes more prevalent, there's a tendency to overvalue their positive effects without adequately weighing the potential drawbacks, such as physician burnout and shifts in health equity. Our group is committed to devising tactics to refine models for specific healthcare environments, ensuring their seamless integration into clinical routines, and meticulously gauging their efficacy. Without rigorous assessments, including strategies like randomization, we cannot be sure these tools are leading to better healthcare. We are developing methods to address these and related problems.
Elizabeth Geena Woo*, Michael C Burkhart*, Emily Alsentzer, Brett Beaulieu-Jones Preprint (2024)
*co-first authors
Our team demonstrated that synthetic data distillation can fine-tune smaller, open-source large-language models (LLMs) to achieve performance similar to larger models in extracting clinical information. This smaller model outperforms its base version and sometimes even the larger model. This approach will enable more scalable and cost-efficient clinical information extraction, improving tasks like patient phenotyping.
Brett K Beaulieu-Jones, Francesca Frau, Sylvie Bozzi, Karen J Chandross, M Judith Peterschmitt, Caroline Cohen, Catherine Coulovrat, Dinesh Kumar, Mark J Kruger, Scott L Lipnick, Lane Fitzsimmons, Isaac S Kohane, Clemens R Scherzer. npj Parkinson's disease (2024)
Our team compared Parkinson's disease (PD) progression across research and real-world populations, utilizing real-world data (RWD) and large language models for detailed characterization. It finds that patients in real-world settings are diagnosed later and start treatment later than those in research populations, with faster motor and cognitive progression in real-world cohorts. The study highlights the differences between research and real-world populations, emphasizing the need to use diverse data sources and account for biases in clinical trial design and analyses.
Beaulieu-Jones, Brett K., Mauricio F Villamar, Phil Scordis, Ana Paula Bartmann, Waqar Ali, Benjamin D Wissel, Emily Alsentzer, Johann de Jong, Arijit Patra, Isaac Kohane. Lancet Digital Health (2023)
Our team demonstrated machine learning models, particularly large language models pre-trained on domain-specific data, are highly effective in predicting seizure recurrence in children after an initial seizure-like event. These models outperformed traditional structured data approaches and indicate that clinical notes contain significant information useful for the prediction of seizure recurrence.
Lane Fitzsimmons, Undiagnosed Diseases Network, Brett Beaulieu-Jones*, Shilpa Nadimpalli Kobren* Preprint (in press) (2024)
*co-corresponding
The biological mechanisms causing extreme symptoms in rare disease patients are complex and often elusive. This study analyzes genotype and phenotype data from the UK Biobank to understand the pathways leading to seizures in undiagnosed patients from the Undiagnosed Diseases Network. By examining milder, related symptoms in UK Biobank participants with similar genetic variants, the study aims to shed light on the molecular mechanisms behind these rare conditions
Lane Fitzsimmons, Francesca Frau, Sylvie Bozzi, Karen Chandross, Brett Beaulieu-Jones Preprint (2024)
This study analyzed Parkinson's disease (PD) progression by examining clinical events across different Hoehn & Yahr (H&Y) stages extracted using natural language processing. It provides a view of healthcare utilization at different H&Y stages, models expected H&Y progression and demonstrates the potential value for a therapeutic which would slow progression.
Beaulieu-Jones, Brett K., William Yuan, Gabriel A. Brat, Andrew L. Beam, Griffin Weber, Marshall Ruffin, and Isaac S. Kohane. NPJ digital medicine (2021)
We trained deep learning models on clinician-initiated administrative data for 42.9 million admissions and found performance close to full EMR-based benchmarks for inpatient outcomes. These models rely heavily on clinical behavior, and should not be used for individualized clinical decision making. For meaningful clinical guidance, models should outperform these benchmarks using data sources that capture patient state rather than clinician actions (i.e., looking over their shoulder).
Beaulieu‐Jones, Brett K., Samuel G. Finlayson, William Yuan, Russ B. Altman, Isaac S. Kohane, Vinay Prasad, and Kun‐Hsing Yu. Clinical Pharmacology & Therapeutics (2020)
The 21st Century Cures Act requires the US FDA to create guidelines for using real-world evidence (RWE) in the regulatory process. While RWE has led to crucial medical findings, it faces challenges in proving treatment efficacy compared to randomized controlled trials. In this review article, we summarized the advantages and limitations of RWE, identified the key opportunities for RWE, and pointed the way forward to maximize the potential of RWE for regulatory purposes.
Beaulieu-Jones, Brett K., Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene. Circulation: Cardiovascular Quality and Outcomes (2019)
Our team has developed a method using deep neural networks to generate synthetic data that closely resembles real participants from the SPRINT trial, ensuring privacy while maintaining the utility of the data for research. This technique allows for the sharing of clinical data with researchers for secondary analysis without risking patient privacy.
Beaulieu-Jones, Brett K., and Casey S. Greene. Nature biotechnology (2017)
Continuous analysis is a workflow that integrates Docker container technology with continuous integration to automatically rerun computational analyses upon any changes in source code or data. This approach facilitates effortless reproducibility of research results for peers and provides an audit trail for data analyses, enhancing transparency and reliability in scientific studies.
Beaulieu-Jones, Brett K., and Casey S. Greene. Journal of biomedical informatics (2016)
We developed a semi-supervised learning technique to improve the extraction of phenotypes from electronic health records, aiding in the identification of disease subtypes and genetic associations. This method has shown promise in enhancing classification accuracy and predicting patient outcomes, even with limited high-quality data.
We're always looking to add talented & curious students, post-docs, progammers and data scientists.
Assistant Professor
Sections of Biomedical Data Science & Genetic Medicine
Research Associate
Research associate examining the impact and potential of diverging predictions between different large-scale clinical AI models for the same patients.
Attending Physician
Bashar provides the clinical expertise for numerous projects in the group - especially in evaluating divergence between predictive models.
Data Scientist
David is working on "Argos" a project developing tools and methods to make the evaluation and monitoring of Clinical AI easier.
Principal Researcher (Center for Applied AI - Booth)
Anna is working closely with the lab on a multi-site project to evaluate the performance and fairness of machine learning models in the clinic and to develop scalable frameworks for this going forward.
PhD Candidate (co-advised with Gilad Lab)
Geena is using machine learning and causal inferences approaches to better understand the impact of maternal exposures on their children.
Senior Data Scientist
Michael is working on multiple projects to improve clinical information extraction and to better understand foundational clinical AI models.
Currently: Med Student
Rowan is working to harmonize governance frameworks in order to provide a practical and actionable roadmap for health systems.
Currently: Med Student
Tom is working to evaluate the performance and failure cases of deployed generative AI systems in the health system.
Currently: Med Student
Sahil is working to build predictive models to support the operational side of hospitals (e.g., unexpected payor denials).
Currently: MS Student - Biomedical Data Science
Krish is developing machine learning and statistical methods to better understand disease progression.
Currently: Completing the MD portion of her MD/PhD at UChicago
Sylvia completed her PhD with Dr. Naoum Issa and collaborates closely with the lab for her projects developing methods to improve seizure detection of neonates.
Lane was a research associate and continues to be a collaborator. She is currently a Med student at the Renaissance School of Medicine at Stony Brook University.
Temi is a PhD Candidate in the Genetics, Genomics and Systems Biology Program at UChicago with Dr. Haky Im and I'm the chair of his thesis committee.
I served as Jess's external examiner for her thesis on "Leveraging Electronic Health Records and Electrocardiograms for Disease Phenotyping" in 2023. She is currently a Machine Learning Scientist at Tempus Labs, Inc.
Yidi was a research associate and is currently a PhD Student in the program I graduated from at the University of Pennsylvania. (Genomics and Computational Biology)
Mohammed was a research associate and is currently a PhD Student at the University of Pennsylvania, where he's working on neuroimaging.
We are actively recruiting multiple students, postdocs and/or data scientists. Get in touch if you're driven to work on these problems, or propose your own ideas tied to our research interests. When reaching out it is incredibly helpful if you specify what led you to reaching out and to tell me about how our interests overlap. I'm open to creative extensions of the research interests listed above (e.g., if you are interested in a different area but would be using similar methods etc.).
We are extremely grateful for the organizations who support and have supported our work!