The Human Pangenome Reference Consortium has presented the first draft of the human pangenome reference sequence. Published in Nature, the pangenome contains 47 phased, diploid assemblies from a genetically diverse cohort, providing a richer reference map to improve the application of genomics for all populations.
Gaps in the genome
Since the completion of the initial draft around 20 years ago, the human reference genome has formed the backbone of human genomics. Within the current GRCh38 human reference genome, there are 210Mb of gap, unknown or computationally simulated sequences, constituting 6.7% in total. These missing reference sequences can create observational bias, limiting studies within the boundaries of the reference.
Recent studies have addressed gaps in the reference genome. For example, Telomere-to-Telomere (T2T) completed the full sequence of a haploid human genome and T2T-CHM13 provided a continuous representation of each autosome and chromosome X. Although these are significant achievements, no single genome can represent the genetic diversity found within humans. In order to overcome the bias associated with a single reference genome, a transition to a pangenome reference is required. Leading the transition is the Human Pangenome Reference Consortium (HPRC).
From genome to pangenome
In the current study, the HPRC sequenced and assembled a set of diverse individual genomes to present the draft human pangenome. This represents a subset of the planned HPRC panel, which aims to capture global genomic diversity across 700 haplotypes and 350 individuals.
To draft the pangenome reference sequence the HPRC first assembled 47 human genomes, selected to represent global genetic diversity. The assemblies included 29 samples with long and linked read sequencing data generated by the HPRC and 18 samples sequenced by others. All assemblies, data and analyses have also been made publicly available. Before constructing the pangenome, the 47 genomes were annotated using a newly developed Ensembl mapping pipeline. Finally, the draft pangenome was constructed using a sequence graph representation, in which different nodes correspond to segments of DNA.
A number of applications of the pangenome reference sequence were presented. These include short variant discovery, improving on the mapping biases inherent in single linear reference genomes, and providing a community pangenome variant resource. Improved tandem repeat representation, RNA sequence mapping and chromatin immunoprecipitation and sequencing analysis were also highlighted by the authors.
Building on the draft
The HPRC have presented openly accessible, diverse assemblies and pangenome graphs, forming a draft pangenome reference. The authors do recognise that this reference is very much a draft, with many challenges remaining in its growth and refinement. The near-term goals are to expand to a diverse cohort of 350 individuals, push towards T2T genomes and refine the alignment methods so that telomere-to-telomere alignment is possible.
With these developments, a more diverse human reference genome can improve understanding of genomics and the ability to predict, diagnose and treat disease. The diversity of the reference pangenome should also help to ensure that the application of genomics and precision medicine are effective for all populations.