Machine learning can fill in missing information in public health data, U of A research shows

A new machine learning tool shows it could help fill significant gaps in Canada’s public health data, according to research released this week.

Research shows program can find ethnicity and Indigenous status data

New research from a University of Alberta epidemiologist shows how machine learning can help fill in gaps in Canadian public health data. (David Bajer/CBC)

A new machine learning tool could help fill significant gaps in Canada's public health data, according to research released this week.

The program is a machine learning framework that finds ethnicity information and Indigenous status in Canadian health records. The program has been used to analyze the names and locations of 4.8 million people surveyed in the 1901 Canadian census using respondents' names, spelling, phonetics and location, among other details, to predict their ethnicity.

Kai On Wong, senior data scientist at the Northern Alberta Clinical Trials and Research Centre, said he was inspired to create the program after seeing gaps in Canadian health data while working in government and for various research groups. 

"If we don't have that information, we cannot study this kind of record and we cannot tell in a consistent and timely fashion across most of these databases which ethnic groups are experiencing worse health outcomes," said Wong, an epidemiologist at the University of Alberta.

Filling in the gaps in health records could allow researchers and public health officials to better investigate, monitor and track health outcomes in different segments of the population, Wong said.

Throughout his research work, Wong noticed data on ethnicity and Indigenous status isn't collected consistently in Canada, compared to American health records. That information can often go unreported in databases tracking acute and chronic diseases.

"It's really a hindrance in terms of how much we can know about it and how much we can piece out the ethnicity, which is a very important dimension for health research information," Wong said.

Using machine learning to fill in the gaps, lets researchers learn more from existing records rather than have to carry out more expensive and time-consuming population-level surveys, Wong said.

Looking ahead, Wong said he recommends updating the tool using more recent census information and testing its accuracy when applied to other health records.