Mobile Menu

BioMe used to Identify new Subpopulations and Disease Risks

Accurately understanding populations and associated disease risks is important to direct resources and treatments effectively. By understanding where someone comes from, their disease burden can be predicted. Previously, populations have been divided based on race and ethnicity, which are often too broad and inaccurate.

A recent study used electronic health records (EHRs) and genetic ancestry to re-examine population data from a Mount Sinai biobank – BioMe – and found 17 distinct communities with different risks and ancestry. This finding differed significantly from the self-reported race and ethnicity (R/E).

Disease Burden

Disease burden refers to the different rates of disease due to a populations genetic and environmental factors. For example, cardiovascular diseases are more common in the Western World versus cancers in Australasia. Using population data to confer disease risk is nothing new. However, relying on self-reported R/E can be misleading as people may not identify themselves as belonging to a certain group. These categories can also be too broad for calculating disease burden.

Genetic Ancestry

Another way to divide populations is to use genetic ancestry. By comparing DNA from individuals, the last common ancestor can be identified based on genetic differences. This can be done via principal-component analysis (PCA) or through identical by descent (IBD) methods. The latter was used by researchers in this study, to identify more recent common ancestors.

IBD works on the theory that we all have a common ancestor in the distant past and thus share a proportion of our DNA with each other. The further that ancestor is back in time, the more recombination that has occurred and the smaller the IBD segment is in our current genome.

Figure 1 Pedigree, recombination and resulting IBD segments, schematic representation – Diagram courtesy of Gklambauer, Wikimedia Commons.

BioMe – The Use of Electronic Health Records

Using real-world evidence to supplement existing datasets and clinical trials is becoming more and more normalised. It helps to reduce costs, time and provide valuable insights into patient subgroups.

This study used over 36,000 EHRs from BioMe, a biobank in New York City, to identify population sub-sets and associated Mendelian disease risks.

BioMe is from the Mount Sinai Health System that has a diverse range of samples from multiple ethnicities. When comparing the EHRs to the self-reported R/E, the researchers found a lot of disagreement within the data. This could be due to reporting errors from the health system or the individual, or due to a lack of appropriate categories.

By using IBD and an unsupervised, scale-free network modelling method, the team found 17 distinct sub-populations – several of which had the signatures of founder populations. Founder populations point to specific migration events and are also associated with higher levels of genetic disorder variants.

In addition, the researchers linked these communities to 1,700 health outcomes and found over 1,100 examples of significantly different disease risks between the populations.


The use of genetic data and EHRs results in a finer population map of an area – which differs significantly from what individuals report. These subpopulations were found to be genetically distinct and have diverse disease risks that would have previously been ignored.

By using these large datasets and genetics to split populations, some of the controversies around using race/ethnicity as delimiters are avoided – such as not being biologically meaningful, or the presence of harmful biases. As experts capture more EHRs and data, we will be able to better monitor health over time and more accurately predict our risk of disease.

More on these topics

Biobank / Health Data / Population Genetics