Primary Research Interests

1. Precision Phenotyping of Complex, Heterogenous Conditions

The presence of heterogeneity in complex conditions indicates the potential for both differing underlying biology / disease etiologies as well as environmental effects. We aim to use machine learning methods to better characterize phenotypes in order to exploit this heterogeneity to better understand biology and/or environmental factors of disease.

2. Safe and Effective Deployment of Machine Learning in the Clinic

The rapid advancement of AI tools in clinical care demands robust strategies in implementation sciences, effective deployment, enhanced physician-computer interaction, and thorough impact assessment. As the creation of cutting-edge models becomes more prevalent, there's a tendency to overvalue their positive effects without adequately weighing the potential drawbacks, such as physician burnout and shifts in health equity. Our group is committed to devising tactics to refine models for specific healthcare environments, ensuring their seamless integration into clinical routines, and meticulously gauging their efficacy. Without rigorous assessments, including strategies like randomization, we cannot be sure these tools are leading to better healthcare. We are developing methods to address these and related problems.

Select Publications

See a full list on Google Scholar

Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study

Beaulieu-Jones, Brett K., Mauricio F Villamar, Phil Scordis, Ana Paula Bartmann, Waqar Ali, Benjamin D Wissel, Emily Alsentzer, Johann de Jong, Arijit Patra, Isaac Kohane. Lancet Digital Health (in press) (2023)

Our team demonstrated machine learning models, particularly large language models pre-trained on domain-specific data, are highly effective in predicting seizure recurrence in children after an initial seizure-like event. These models outperformed traditional structured data approaches and indicate that clinical notes contain significant information useful for the prediction of seizure recurrence.

Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?

Beaulieu-Jones, Brett K., William Yuan, Gabriel A. Brat, Andrew L. Beam, Griffin Weber, Marshall Ruffin, and Isaac S. Kohane. NPJ digital medicine (2021)

We trained deep learning models on clinician-initiated administrative data for 42.9 million admissions and found performance close to full EMR-based benchmarks for inpatient outcomes. These models rely heavily on clinical behavior, and should not be used for individualized clinical decision making. For meaningful clinical guidance, models should outperform these benchmarks using data sources that capture patient state rather than clinician actions (i.e., looking over their shoulder).

Examining the use of real‐world evidence in the regulatory process

Beaulieu‐Jones, Brett K., Samuel G. Finlayson, William Yuan, Russ B. Altman, Isaac S. Kohane, Vinay Prasad, and Kun‐Hsing Yu. Clinical Pharmacology & Therapeutics (2020)

The 21st Century Cures Act requires the US FDA to create guidelines for using real-world evidence (RWE) in the regulatory process. While RWE has led to crucial medical findings, it faces challenges in proving treatment efficacy compared to randomized controlled trials. In this review article, we summarized the advantages and limitations of RWE, identified the key opportunities for RWE, and pointed the way forward to maximize the potential of RWE for regulatory purposes.

Privacy-preserving generative deep neural networks support clinical data sharing

Beaulieu-Jones, Brett K., Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, and Casey S. Greene. Circulation: Cardiovascular Quality and Outcomes (2019)

Our team has developed a method using deep neural networks to generate synthetic data that closely resembles real participants from the SPRINT trial, ensuring privacy while maintaining the utility of the data for research. This technique allows for the sharing of clinical data with researchers for secondary analysis without risking patient privacy.

Reproducibility of computational workflows is automated using continuous analysis

Beaulieu-Jones, Brett K., and Casey S. Greene. Nature biotechnology (2017)

Continuous analysis is a workflow that integrates Docker container technology with continuous integration to automatically rerun computational analyses upon any changes in source code or data. This approach facilitates effortless reproducibility of research results for peers and provides an audit trail for data analyses, enhancing transparency and reliability in scientific studies.

Semi-supervised learning of the electronic health record for phenotype stratification

Beaulieu-Jones, Brett K., and Casey S. Greene. Journal of biomedical informatics (2016)

We developed a semi-supervised learning technique to improve the extraction of phenotypes from electronic health records, aiding in the identification of disease subtypes and genetic associations. This method has shown promise in enhancing classification accuracy and predicting patient outcomes, even with limited high-quality data.


We're always looking to add talented & curious students, post-docs, progammers and data scientists.

Brett Beaulieu-Jones, PhD

Assistant Professor

Sections of Biomedical Data Science & Genetic Medicine

Nafiseh (Cati) Mollaei, PhD

Postdoctoral Fellow

Cati is working on methods to improve large language models for phenotyping and disease subtyping.

Anna Zink, PhD

Principal Researcher (Center for Applied AI - Booth)

Anna is working closely with the lab on a multi-site project to evaluate the performance and fairness of machine learning models in the clinic and to develop scalable frameworks for this going forward.

Ming-Chieh (Eddie) Liu, BBA

MS Student

Research associate building new methods to extract phenotypic signal from neuro-imaging & diagnostic data

Sylvia Edoigiawerie, PhD

Currently: MD / PhD Student

Sylvia completed her PhD with Dr. Naoum Issa and collaborates closely with the lab for her projects developing methods to improve seizure detection of neonates.

David Chen

MS Student

David is working on a project developing tools and methods to monitor deployed AI in healthcare.

Ike Bowen

Undergradate Student

Ike is working on a project to extract structured phenotypic information from clinical notes.

Alumni, Close Collaborators & Thesis Examinees

Lane Fitzsimmons, BS

Lane was a research associate and continues to be a collaborator. She is currently a Med student at the Renaissance School of Medicine at Stony Brook University.

Temidayo Adeluwa, MS

Temi is a PhD Candidate in the Genetics, Genomics and Systems Biology Program at UChicago with Dr. Haky Im and I'm the chair of his thesis committee.

Jessica De Freitas, PhD

I served as Jess's external examiner for her thesis on "Leveraging Electronic Health Records and Electrocardiograms for Disease Phenotyping" in 2023. She is currently a Machine Learning Scientist at Tempus Labs, Inc.

Yidi Huang, MS

Yidi was a research associate and is currently a PhD Student in the program I graduated from at the University of Pennsylvania. (Genomics and Computational Biology)

Mohammed Saqib, BS

Mohammed was a research associate and is currently a PhD Student at the University of Pennsylvania, where he's working on neuroimaging.

Open Positions

We are actively recruiting multiple students, postdocs and/or data scientists. Get in touch if you're driven to work on these problems, or propose your own ideas tied to our research interests. When reaching out it is incredibly helpful if you specify what led you to reaching out and to tell me about how our interests overlap. I'm open to creative extensions of the research interests listed above (e.g., if you are interested in a different area but would be using similar methods etc.).

Active / Recent Funding and Support

We are extremely grateful for the organizations who support and have supported our work!

Get in touch