Original article written by Shannon Gunn, August 2020. Updated by Lyndsey Fletcher, July 2023.
Polygenic risk scores are a particularly hot topic in the clinical genomics world. They can be used to predict an individual’s risk of disease, but many view their use as controversial. In this feature, we take a look at the fundamental aspects of polygenic risk score analysis and what to consider when calculating them.
A 2020 article published in Nature Protocols comprehensively highlighted key issues related to polygenic risk score analyses and provided a starting point and reference guide for researchers on how to perform polygenic score analyses. Given the limitations that many associate with polygenic scoring, it is vital that scientists know how to perform the calculations correctly.
A polygenic risk score (PRS) is a single value estimate of an individual’s genetic liability to a trait or disease. It is calculated by the sum of an individual’s risk alleles, weighted by risk allele effect sizes derived from genome-wide association study (GWAS) data.
In recent years, PRS has generated a lot of excitement in the field due to its potential applications in precision medicine. A PRS can represent individual genetic predictions of phenotypes; in other words, it can predict a person’s individual risk of developing a particular disease.
The 2020 guide is split into three main sections – QC of base and target data, calculation of PRSs and interpretation and presentation of results – providing recommendations for best practice in PRS analyses (summarised in the Figure below).
Figure: Summary of the key features of a PRS analysis
QC of base and target data
PRS analyses require two main input data sets: (1) Base data (GWAS) – contains summary statistics of single-nucleotide variants; and (2) Target data – consists of genotypes and usually phenotype(s) in individuals from a sample independent of the GWAS sample. The quality of these datasets determines the power and validity of the PRS analyses and therefore they must undergo several quality control (QC) steps. The researcher’s recommendations for QC are outlined below:
QC relevant to base data only:
- Heritability check – only perform PRS analyses on GWAS data with a h2SNP >0.05.
- Effect allele – the identity of the effect allele must be obtained from GWAS investigators.
QC relevant to target data only:
- To minimise generation of misleading results, only perform PRS analyses that involve association testing on target sizes of ≥100 individuals.
QC relevant to both base and target data:
- File transfer – ensure files have not been corrupted during transfer.
- Genome build – ensure that SNPs from both datasets have genomic positions assigned to the same build.
- Standard GWAS QC – follow established guidelines to perform standard GWAS QC.
- Ambiguous SNPs – remove all ambiguous SNPs to avoid introducing systematic errors.
- Mismatching SNPs – strand-flip the alleles; most PRS software perform strand-flipping automatically for SNPs that are resolvable and removes those that are not.
- Duplicate SNPs – ensure there are no duplicates to avoid errors and system crashing.
- Sex chromosomes – remove sex chromosomes’ SNPs if analysis is looking at autosomal genetics only.
- Sample overlap – remove overlapping samples to avoid inflation; researchers recommend judicious use of target samples.
- Relatedness – remove any closely related individuals to avoid inflation.
Calculation of PRSs
After QC, the next step is to calculate PRSs for all individuals in the target sample. The authors highlight key things to consider in the development of methods for calculating PRSs, as summarised below:
Adjustment of GWAS estimated effect sizes via, for example, shrinkage:
- To address the issues of unadjusted effect size estimates of SNPs generating poorly estimated PRSs, two broad strategies have been adopted:
- Shrinking all of the SNP effect estimates using standard or tailored statistical techniques, e.g., LASSO.
- Using P-value selection thresholds as inclusion criteria for SNPs into the score.
Accounting for LD:
- If the investigator is using a one-SNP-at-a-time based GWAS, there are two main ways for approximating the PRS:
- Clumping SNPs, so that the retained SNPs are largely independent of each other.
- Including all SNPs, accounting for the linkage disequilibrium (LD) between them.
Tailoring of PRSs to target populations:
- It is important to remember that the units of the PRS are determined by the units of the GWAS effect sizes. In addition, PRS values only provide a relative estimate of risk as they are computed based on a hypothetical individual.
- If confounding effects from population genetic structure are not corrected for, this can result in false positive genotype-phenotype associations, resulting in inflated estimates. Stringent adjustment of effects should be applied, e.g., via genetic principal components or using family data.
- Published studies sometimes involve a target phenotype that is different from the one on which the original PRS was based. Using multiple PRSs as predictors raises concerns due to the SNPs being inherently correlated. Therefore, risks of overfitting and multicollinearity should be minimised using techniques such as shrinkage.
Interpretation and presentation of results
Once PRSs are calculated, a regression is typically performed. Considerations for how results from PRS analyses are interpreted and presented are summarised below:
Measuring and plotting of analyses:
- Typical PRS studies involve measuring association between a PRS and a trait(s), for example, using standard association or goodness-of-fit metrics. When the disease prevalence can be well approximated, the researchers recommend using a pseudo-R2 metric developed by Lee et al, 2012.
- When the classic PRS method is used, results of PRS association tests are typically displayed as a bar plot. When inspecting how trait values vary with increasing PRS, a quantile plot can also be used for visualisation.
How to avoid overfitting:
- Performing out-of-sample prediction is the gold-standard strategy for protecting against generating overfit prediction models.
- The central theorem suggests that the PRS of a sample should represent that of a Gaussian distribution. Therefore, inspection of PRS distributions could indicate potential errors.
- Interpreting results for clinical utility:
- PRSs have potential utility in diagnosis and precision medicine.
- Individual-level PRSs have the potential to offer higher predictive power than family history alone. Nevertheless, present PRSs often have low predictive power, which has generated a lot of debate over their direct clinical utility.
- Researchers suggest there is a critical need for cost-benefit analyses.
- Interpreting results for PRS-trait associations:
- PRS may be useful in establishing strength of genetic associations in a range of traits.
- At present, PRSs for many traits only explain a very small amount of phenotypic variance. PRS results from association studies with very small estimated effects should be treated with caution due to potential untreated confounding effects.
Predictive accuracy and power of PRS analyses:
- The power of PRS association studies is optimised by using equal sized base and target samples.
Using UK Biobank data and validating existing PRSs
In the years since the article’s publication, there have been numerous developments regarding the use of PRSs and even more guidance published on how to calculate and apply them. In a 2022 article, the authors highlighted the key points to note when validating pre-existing scores, specifically when using UK Biobank data.
- UK Biobank PRS data comes in BGEN v1.2 format, which can be read by the programmes bgenix, QCTOOLS or PLINK 2. They recommend obtaining summary statistics using PLINK 2 as a first step.
- Choosing which PRS to work with should be based on your research objectives, the performance of the PRS and any technical considerations such as sample overlap.
- Extracting SNP data is a crucial early step – the authors recommend using bgenix for this purpose.
- The next step is to align your base and validation data, for which you will need to know the genome build used in the original calculation.
- You must perform both SNP and sample QC to assess the validity of your experiment.
- The final step prior to calculating the PRS, using the steps above, is to calculate allelic doses.
The comprehensive guide to validating pre-existing scores can be found here.
Since the initial publication of this article, there have been significant developments regarding the use of PRSs in the clinic. In 2021, the NHS announced a trial that would use PRSs to calculate a patient’s risk of cardiovascular disease, with over 1000 patients providing blood samples in the pilot study. Over a million genetic datapoints were analysed from each patient’s sample and, in combination with known markers for other cardiovascular risk factors, a polygenic score was calculated. Results from November 2022 indicated that the trial had positive outcomes.
However, the announcement was met by controversy. PRSs have faced scrutiny due to their potential limitations on an individual level, despite their relative accuracy on a population level. Moreover, much of the data used to calculate the scores is homogenous, and a lack of diversity in the samples can lead to inaccurate scoring for some.
That said, some believe that the use of PRSs could enhance clinical trials by facilitating the selection of individuals who would benefit from a certain treatment prior to testing, removing the element of randomness. However, a 2023 review of the present and future uses of PRSs in the clinic suggested that the clinical applications are still unclear, and that diversity issues persist. They concluded that PRSs could have useful clinical impact should research continue into the true value of the scores and researchers strive to use and create more ethnically diverse data.
Choi, S.W., Mak, T.S.H. and O’Reilly, P.F., 2020. Tutorial: a guide to performing polygenic risk score analyses. Nature protocols, 15(9), pp.2759-2772.
Collister, J.A., Liu, X. and Clifton, L., 2022. Calculating polygenic risk scores (PRS) in UK Biobank: a practical guide for epidemiologists. Frontiers in genetics, 13, p.818574.
Fahed, A.C., Philippakis, A.A. and Khera, A.V., 2022. The potential of polygenic scores to improve cost and efficiency of clinical trials. Nature Communications, 13(1), p.2922.
Sud, A., Horton, R.H., Hingorani, A.D., Tzoulaki, I., Turnbull, C., Houlston, R.S. and Lucassen, A., 2023. Realistic expectations are key to realising the benefits of polygenic scores. bmj, 380.