A recent article published in Nature Protocols highlights key issues related to polygenic risk score analyses and provides a starting point and reference guide for researchers on how to perform polygenic score analyses.
A polygenic risk score (PRS) is a single value estimate of an individual’s genetic liability to a trait or disease. It is calculated by the sum of an individual’s risk alleles, weighted by risk allele effect sizes derived from genome-wide associated study (GWAS) data.
In recent years, PRS has generated a lot of excitement in the field due to its potential applications in precision medicine. A PRS can represent individual genetic predictions of phenotypes; in other words, it can predict a person’s individual risk of developing a particular disease.
The recent article is split into three main sections – QC of base and target data, Calculation of PRSs and Interpretation and presentation of results – providing recommendations for best practice in PRS analyses (summarised in the Figure below).
QC of base and target data
PRS analyses require two main input data sets: (1) Base data (GWAS) – contains summary statistics of single-nucleotide variants; and (2) Target data – consists of genotypes and usually phenotype(s) in individuals from a sample independent of the GWAS sample. The quality of these datasets determines the power and validity of the PRS analyses and therefore they must undergo several quality control (QC) steps. The researcher’s recommendations for QC are outlined below:
QC relevant to base data only:
- Heritability check – only perform PRS analyses on GWAS data with a h2SNP >0.05.
- Effect allele – the identity of the effect allele must be obtained from GWAS investigators.
QC relevant to target data only:
- To minimise generation of misleading results, only perform PRS analyses that involve association testing on target sizes of ≥100 individuals.
QC relevant to both base and target data:
- File transfer – ensure files have not been corrupted during transfer.
- Genome build – ensure that SNPs from both datasets have genomic positions assigned to the same build.
- Standard GWAS QC – follow established guidelines to perform standard GWAS QC.
- Ambiguous SNPs – remove all ambiguous SNPs to avoid introducing systematic errors.
- Mismatching SNPs – strand-flip the alleles; most PRS software perform strand-flipping automatically for SNPs that are resolvable and removes those that are not.
- Duplicate SNPs – ensure there are no duplicates to avoid errors and system crashing.
- Sex chromosomes – remove sex chromosomes’ SNPs if analysis is looking at autosomal genetics only.
- Sample overlap – remove overlapping samples to avoid inflation; researchers recommend judicious use of target samples.
- Relatedness – remove any closely related individuals to avoid inflation.
Calculation of PRSs
After QC, the next step is to calculate PRSs for all individuals in the target sample. The authors highlight key things to consider in the development of methods for calculating PRSs, as summarised below:
Adjustment of GWAS estimated effect sizes via, for example, shrinkage:
- To address the issues of unadjusted effect size estimates of SNPs generating poorly estimated PRSs, two broad strategies have been adopted:
- Shrinking all of the SNP effect estimates using standard or tailored statistical techniques, e.g. LASSO.
- Using P-value selection thresholds as inclusion criteria for SNPs into the score.
Accounting for LD:
- If the investigator is using a one-SNP-at-a-time based GWAS, there are two main ways for approximating the PRS:
- Clumping SNPs, so that the retained SNPs are largely independent of each other.
- Including all SNPs, accounting for the linkage disequilibrium (LD) between them.
Tailoring of PRSs to target populations:
- It is important to remember that the units of the PRS are determined by the units of the GWAS effect sizes. In addition, PRS values only provide a relative estimate of risk as they are computed based on a hypothetical individual.
- If confounding effects from population genetic structure are not corrected for, this can result in false positive genotype-phenotype associations, resulting in inflated estimates. Stringent adjustment of effects should be applied, e.g. via genetic principal components or using family data.
- Published studies sometimes involve a target phenotype that is different from the one on which the original PRS was based. Using multiple PRSs as predictors raises concerns due to the SNPs being inherently correlated. Therefore, risks of overfitting and multicollinearity should be minimised using techniques such as shrinkage.
Interpretation and presentation of results
Once PRSs are calculated, a regression is typically performed. Considerations for how results from PRS analyses are interpreted and presented are summarised below:
Measuring and plotting of analyses:
- Typical PRS studies involve measuring association between a PRS and a trait(s), for example, using standard association or goodness-of-fit metrics. When the disease prevalence can be well approximated, the researchers recommend using a pseudo-R2 metric developed by Lee et al, 2012.
- When the classic PRS method is used, results of PRS association tests are typically displayed as a bar plot. When inspecting how trait values vary with increasing PRS, a quantile plot can also be used for visualisation.
How to avoid overfitting:
- Performing out-of-sample prediction is the gold-standard strategy for protecting against generating overfit prediction models.
- The central theorem suggests that the PRS of a sample should represent that of a Gaussian distribution. Therefore, inspection of PRS distributions could indicate potential errors.
- Interpreting results for clinical utility:
- PRSs have potential utility in diagnosis and precision medicine.
- Individual-level PRSs have the potential to offer higher predictive power than family history alone. Nevertheless, present PRSs often have low predictive power, which has generated a lot of debate over their direct clinical utility.
- Researchers suggest there is a critical need for cost-benefit analyses.
- Interpreting results for PRS-trait associations:
- PRS may be useful in establishing strength of genetic associations in a range of traits.
- At present, PRSs for many traits only explain a very small amount of phenotypic variance. PRS results from association studies with very small estimated effects should be treated with caution due to potential untreated confounding effects.
Predictive accuracy and power of PRS analyses:
- The power of PRS association studies is optimised by using equal sized base and target samples.
The authors conclude by emphasising the necessity of developing methods to exploit, analyse and interpret PRSs effectively.
For a step-by-step guide to performing basic PRS analyses see the online tutorial.
Image credit: Background photo created by kjpargeter – www.freepik.com