Genome-wide association studies (GWAS) are commonly used in the genomics world to identify variants associated with a trait or disease. During a GWAS, researchers will identify areas of the genome that show a strong signal linked to the trait in question; that is, in the discovery cohort, areas or specific markers in which variation is more common than in the controls.
Following the completion of the Human Genome Project, a number of tools were created to make this approach more feasible. Previously, candidate gene studies typically centred around areas of the genome that were already suspected of having an impact, and identifying those candidate genes was a laborious process. In a GWAS, discovery is entirely agnostic, and allows for the identification of previously unknown areas of significance in a relatively simple fashion.
Over the years, insights gained from GWAS have had huge implications for disease research. However, the associations often have small effect sizes, requiring large cohorts to achieve statistical significance, meaning high computing power is required. They also only establish correlation, not causation, so there are several steps that must be carried out post-GWAS to firmly establish the true molecular underpinnings of a trait. Nonetheless, GWAS are a crucial tool in deciphering the intricate relationship between genetics and complex traits, and they remain a cornerstone in improving healthcare strategies and targeted therapies.
Choosing your cohort
There are a number of databases available to researchers that contain genome-wide data for use in GWAS. One of the most commonly used is UK Biobank. The larger the cohort the better; there should usually be at least 200 cases, although more is desirable. To confirm true associations within your discovery cohort, you should have significantly more controls than cases. If you choose to establish your own cohort, you can still use resources like UK Biobank to obtain your controls. See here for a more comprehensive list of resources.
Get the right software
The most commonly used GWAS software is PLINK, a command line program that can run association analyses and also perform quality control and regression steps, among other useful features. Given the size and nature of genomic data, it is likely that you will be using PLINK via remote access to a high-performance computer, in which case you will probably need to download the Linux version of the programme. Follow your HPC’s protocol for installing programmes to the shell. Alternatively, you can download PLINK for Windows or Mac, and launch from command prompt. Additionally, there are alternatives to PLINK such as SNPTEST, which can calculate single SNP associations.
Build your command
PLINK is a command line programme, so you need to know how to build a script. How you begin this script will vary depending on your host computer.
To signify that you want to use PLINK, you would simply start your GWAS command with the word ‘plink’.
PLINK commands use ‘flags’ – two hyphens followed by a relevant term that signifies that the program needs to carry out a non-default action. In some cases, flags should be followed by a number or other term like a filename, to specific exactly what the flag should signify. For a comprehensive list of PLINK flags, click here.
Commands can include all of your needs in one line. However, if you want to permanently create files with certain data filtered out or otherwise transformed, then it can be helpful to run a number of lines, each with their own explicit output. Some essential aspects of running a PLINK command are detailed below.
- Choose your files – the file you wish to work on can be identified using the flag –file. You can jump straight into using this file in your analysis, or do data transformations, such as creating compressed bed files (–make-bed).
- Quality control – one of the most important steps, quality control steps can be included in your PLINK command. This can involve excluding insufficient SNP data, removing individuals who are missing a certain amount of marker data or eliminating SNPs that are not in Hardy-Weinberg Equilibrium. These inclusion thresholds can be set using the relevant flags (for example, –hwe for Hardy Weinberg equilibrium or –geno for missing SNP data) followed by your chosen threshold. This step is particularly important, as including either markers or individuals with a high level of missing data can skew the results of an association test. On the other hand, Hardy-Weinberg equilibrium is more nuanced. A particularly high disequilibrium can indicate true linkage, but a lower (yet still unexpected) level of disequilibrium can indicate genotyping errors. This means you must carefully choose the threshold you wish to set in your quality control.
- Choose other parameters –PLINK can be used to carry out a number of other actions such as population stratification and other types of quality control and data organisation. Pick which is relevant to your experiment and include the flags in your commands.
- Run association tests – PLINK can carry out a number of different association tests. The most basic of these simply tests the associations between cases and controls and uses the flag –assoc. This flag can also be used to test for basic associations in a quantitative trait. Other options include the Fisher’s exact test and linear and logistic regressions that can assess the impact of covariates such as sex.
- Once you have built the perfect command – run your script according to the rules of your host.
- Name your output file – following your association test, PLINK will generate files that contain all of your genome-wide significance data that you can take forward for further analysis.
Example PLINK command:
plink –file filename –assoc
This command will run a basic association test on the given file, without data filtering or regression.
Visualise genome-wide significance
The most common way to visualise GWAS results is using a Manhattan plot. Once PLINK has generated your data and a significance score (P-value) for each marker, you can generate these plots to assess your data. It is likely that you will use R to do this. The most commonly used R-package is QQman.
Figure 1. Example of a genome-wide Manhattan plot. Each chromosome is shown in a different colour. Markers that have reached genome-wide significance appear above the dotted line, indicating the threshold. Credit to: Ikram. et al (2010).
The simple QQman command manhattan(filename) will generate a Manhattan plot. Additional commands to show specific chromosomes and other visual effects are included in the package. Typically, markers that are at genome-wide significance (usually p < 5 x 10-8) will appear above a line denoting this threshold (see Figure 1).
Alternatively, you can look at the most significant markers in a QQplot. These figures order data by statistical significance rather than position like a Manhattan plot. This makes it easier to see which areas of the genome are in linkage disequilibrium with each other.
SNPs could reach genome-wide significance because they are causative or because they are in linkage disequilibrium with the causative variant. Therefore, after obtaining candidate regions from your GWAS, the work isn’t over. You need to find the causative variant, and sometimes this is not specifically identified in the GWAS itself. This is because only a limited number of genome markers are included in a typical study.
Functional analysis with online databases
You can use online tools like HapMap to locate genes in the significant region, allowing you to perform functional analysis. There are also tools such as the Ensembl Variant Effect Predictor that can tell you whether a variant is likely to be pathogenic, which can help to further narrow down the likelihood of a certain variant causing a trait or disease.
After you successfully identify the relevant gene, next steps can include pathway analysis and the use of cell culture or animal knockout studies to assess the impact of the variant. These results can then be used to inform health decisions and diagnostics. However, it’s important to note that finding a causative variant does not always lead to a cure for a disease. Therefore, careful messaging when presenting results is important.
See this resource from Nature for more help with GWAS.
Click here to read about some GWAS success stories from over the years.
Ikram, M.K., Xueling, S., Jensen, R.A., Cotch, M.F., Hewitt, A.W., Ikram, M.A., Wang, J.J., Klein, R., Klein, B.E., Breteler, M.M., Cheung, N. 2010. Four novel Loci (19q13, 6q24, 12q24, and 5q14) influence the microcirculation in vivo. PLoS genetics, 6(10), p.e1001184.
Marees, A.T., de Kluiver, H., Stringer, S., Vorspan, F., Curis, E., Marie‐Claire, C. and Derks, E.M., 2018. A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis. International journal of methods in psychiatric research, 27(2), p.e1608.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., De Bakker, P.I., Daly, M.J., Sham, P.C. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), pp.559-575.
Uffelmann, E., Huang, Q.Q., Munung, N.S., De Vries, J., Okada, Y., Martin, A.R., Martin, H.C., Lappalainen, T. and Posthuma, D. 2021. Genome-wide association studies. Nature Reviews Methods Primers, 1(1), p.59.