Written by Vered Smith, Science Writer
A paper published in Science presented the sequencing of the entire human genome, including the 8% that could not be accessed until now. Using Nanopore sequencing and PacBio HiFi read technology, researchers from The Telomere-to-Telomere (T2T) Consortium sequenced the complete 3.055 billion–base pair human genome, including centromeres and telomeres. This will enable studies of genomic variation across the entire human genome, driving new discoveries in genomic medicine.
Human Genome Reference
The current human reference genome (GRCh38) was completed in 2013. However, it contains large reference gaps, including human satellite repeat arrays and the short arms of all five acrocentric chromosomes. To complete the reference genome, scientists assembled the CHM13hTERT cell line. This reference originated from the Human Genome Project, and has been continually improved since then.
It was assembled from sequenced bacterial artificial chromosomes that were oriented along the human genome by radiation hybrid, genetic linkage, and fingerprint maps. However, these methods resulted in limitations including underrepresentation of repetitive sequences.
This new research was enabled by PacBio HiFi sequencing and Oxford Nanopore technology. These are long-read shotgun sequencing methods that can read long fragments of DNA at very high levels of accuracy, overcoming past limitations.
Results of CHM13
The new CHM13 genome contains 3,054,815,472 base pairs of nuclear DNA, and 16,569 base pairs of mitochondrial DNA. Of this, 182 mega–base pairs (Mbp) of the sequence had no primary alignments to GRCh38 and were original to CHM13. 140 new genes in CHM13 were predicted to be protein-coding based on their GENCODE paralogs. CHM13 uncovered the genomic structure of the short arms of the five acrocentric chromosomes, which until now had been mostly unsequenced, despite their importance in cell functioning.
The scientists also reanalysed 3202 short-read datasets, and demonstrated that CHM13 reduced false-negative and false-positive variant calls. This is due to the addition of 182 Mbp of sequence, and the exclusion of 1.2 Mbp of falsely duplicated sequence in GRCh38.
The researchers then produced a collection of annotations and omics datasets for CHM13 including RNA sequencing, Iso-seq, precision run-on sequencing, cleavage under targets and release using nuclease (CUT&RUN), and ONT methylation experiments. These datasets are available via a centralized genome browser.
One example of improved research opportunities the scientists noted is for FRG1, a duplicated gene linked to facioscapulohumeral muscular dystrophy (FSHD). The CHM13 reference has 23 copies of FRG1, compared to only nine copies in the old genome reference. This increase in information provides a better foundation for researching this disease.
It should be noted that this genome is not based on a healthy human cell genome, but on a haploid cancer genome. Nevertheless, it is a very exciting development and a huge leap forward. As Gordon Sanghera, CEO of Oxford Nanopore Technologies, commented, “The complete, telomere-to-telomere assembly of a human genome marks the next era of genomics and opens up huge research potential in human health and disease.”
Image Credit: Canva