Researchers have introduced a new short-read mapping tool, called Giraffe, that can efficiently plot new genome sequences to a pangenome reference.
Since the human genome was first sequenced more than 20 years ago, the study of genomics has relied almost exclusively on a single reference genome to which others are compared to identify genetic variations. However, it’s now recognised that a single reference genome cannot capture the complete diversity within a whole human population, or even a single person. This is because when a person’s genome differs from the reference by a structural variation, the reference may not contain a location to correctly map the corresponding reads.
The pangenome represents the entire set of genes within a species. This consists of a core genome, which contains sequences shared between all individuals of the species, and a variable set of genetic material, called the dispensable genome. Studies of the relationship between pangenomes sequenced with short-read technology is often done computationally by mapping the sequences to a reference genome.
Benedict Paten, Associate Professor of Biomolecular Engineering at UC Santa Cruz and Associate Director of the Genomics Institute, explained:
“The workhorses of genomics have been Single Nucleotide Variants (SNVs) and short indels, because structural variants have been hidden from view. Pangenomics is making structural variants visible so we can study them the same way we do SNVs and short indels. There are a lot of structural variants and they can have a big impact, so this is critical for the future of genetic studies of disease.”
Recently, a team of researchers at the UC Santa Cruz Genomics Institute have introduced a new short-read mapping tool, called Giraffe, that can efficiently plot new genome sequences to a pangenome reference. The novel bioinformatic tool was developed to improve mapping pangenomes in polymorphic regions of the genome, making large-scale genomic analyses more accessible.
The researchers reported that Giraffe could accurately map reads to thousands of genomes embedded in a pangenome reference as quickly and as accurately as existing tools map to a single reference genome. In contrast to previous bioinformatic tools, Giraffe focuses on mapping to the reference haplotypes, which are observed in individuals’ genomes. This is an advantageous approach because it prioritises alignments that are consistent with known sequences. In addition, it reduces the size of the problem by limiting the sequence space to which the reads can be aligned. Therefore, using Giraffe instead of a single reference genome reduces mapping bias by limiting the tendency to incorrectly map reads that differ from the reference genome.
To confirm Giraffe’s usefulness, the team used mappings from the tool to genotype 167,000 structural variations in short-read samples of 5,202 people. They were able to estimate the frequency of different versions of the structural variations in human whole populations and subpopulations for an average computational cost of $1.50 per sample. Additionally, they identified thousands of the structural variations as expression quantitative trait loci (eQTLs), which are associated with gene-expression levels.
Benedict Paten said:
“We’ve been working toward this for years, and now for the first time we have something practical that works fast and works better than the single reference genome. It’s important for the future of biomedicine that genomics helps everyone equally, so we need tools that account for the diversity of human populations and are not biased.”
Efficiently mapping haplotypes
The researchers in this study showed that Giraffe allows short-read data to genotype single-nucleotide variations, short indels and structural variations more accurately than previous methods. Overall, Giraffe demonstrates the practicality and efficiency of a pangenomic approach to short-read mapping. Moreover, by making more broadly representative pangenome references practical, it attempts to make genomics more inclusive.
Jean Monlong, co-first author of this study, explained:
“A lot of structural variants have been discovered recently using long-read sequencing. With pangenomes, we can look for these structural variants in large datasets of short-read sequencing. It’s exciting because this will allow us to study those new structural variants across many people and ask questions about their functional impact, association with disease, or role in evolution.”
Image credit: Harvard Gazette