Loading Events

« All Events

  • This event has passed.

Thesis Defence: Extraction of Social Determinants of Health from Electronic Health Records using Natural Language Processing

December 13, 2023 at 2:00 pm - 6:00 pm

Zhenghua (Mavis) Chen will defend their thesis.

Zhenghua (Mavis) Chen, supervised by Dr. Rasika Rajapakshe and Dr. Patricia Lasserre, will defend their thesis titled “Extraction of Social Determinants of Health from Electronic Health Records using Natural Language Processing” in partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

An abstract for Zhenghua (Mavis) Chen’s thesis is included below.

Defences are open to all members of the campus community as well as the general public. Registration is not required for in person defences.


Purpose: Social Determinants of Health (SDoH) have a significant impact on human health outcomes and disparities. Collecting SDoH from electronic health records can facilitate decision-making and downstream research. With thousands of clinical records, automated extraction methods using Artificial Intelligence (AI) would be more efficient and cost-effective. This study aims to autonomously extract comprehensive SDoH details from Electronic Health Records (EHR) using a Natural Language Processing (NLP) based AI pipeline.

Methods: One thousand documents from BC Cancer with concentrated SDoH information were carefully selected and labeled to provide the ground truth for training and assessing the NLP models. Two pipelines were applied for SDoH extraction: an open source pipeline trained on the BC Cancer dataset and an industrial pre-trained solution used as a benchmark. To optimize the performance of the first pipeline, three experiments were conducted to justify the effect of including subtype word positions during training on the extraction performance. The results of two pipelines were compared and the best performing one was subsequently employed for the extraction of SDoH information from a total of 13,258 oncology documents.

Results: The open-source pipeline gained an average F1 score of 0.88 on the validation dataset for extracting 13 SDoH factors, outperforming the benchmark by 5%. This pipeline also demonstrated a notably superior capability to extract detailed subtypes compared with the benchmark. The benchmark was advantageous in identifying rarely documented SDoH types in the data for extraction in this work. As a result, 60,717 SDoH factors and associated details were extracted from the oncology documents from the BC Cancer EHR. The most frequently extracted SDoH are Tobacco Use, Employment Status, Marital Status, Alcohol Consumption, and Living Status, which occurred from 8k to 12k times.

Conclusion: The NLP pipeline successfully extracted a wide array of SDoH factors from clinical notes, achieving commendable performance despite being trained on a relatively small labeled dataset.


December 13, 2023
2:00 pm - 6:00 pm


Arts and Sciences Centre (ASC)
3187 University Way
Kelowna, BC V1V 1V7 Canada
+ Google Map

Additional Info

Room Number
Registration/RSVP Required
Event Type
Thesis Defence
Health, Research and Innovation, Science, Technology and Engineering
Alumni, Community, Faculty, Staff, Families, Partners and Industry, Students, Postdoctoral Fellows and Research Associates