Different sequencing platforms use a variety of read lengths. Both short-read sequencing and long-read sequencing have their own benefits and flaws, depending on what the experiment is aiming to accomplish.
Short-read sequencing
Short-read technologies carry out sequencing by synthesis or ligation. Each strategy uses DNA polymerase or ligase enzymes, respectively, to extend numerous DNA strands in parallel. Nucleotides can either be provided one at a time, or they can be modified with identifying tags.
Short-read sequencing technologies can be further categorized as either single molecule-based, involving the sequencing of a single molecule, or ensemble-based, which is the sequencing of multiple identical copies of a DNA molecule that have usually been amplified together on isolated beads.
Furthermore, these methods could be real-time or synchronous controlled. Real-time short-read sequencing consists of a free-running DNA polymerase that catalyzes all possible nucleotides. Therefore, this method requires the identification of the newly sequenced nucleotides as they are being incorporated, without interrupting the synthesis process. This can be done using optical or physical techniques, which reveal tagged nucleotides that are either free or bound. The bound nucleotides are assumed to be involved in DNA synthesis. Synchronous-controlled approaches use information to facilitate the identification process in an interrupted fashion, which can be achieved by adding a single type of nucleotide at once or using nucleotide reversible terminators.
Here are some examples of short-read sequencing technologies:
- Illumina – Single-stranded DNA-binding proteins are used for amplification, followed by the addition of fluorescent-labelled deoxynucleoside triphosphates to bridge the amplified DNA template.
- 454 pyrosequencing – Clonal amplification is done by emulsion PCR, generating microbead-bound DNA clones. Emulsion PCR allows the amplification of DNA in physically separated water-in-oil droplets, avoiding unwanted reactions between similar DNA sequences. DNA polymerase is added and the nucleotides are washed. Deoxynucleoside triphosphate incorporation is then monitored by pyrophosphate release.
- Ion Torrent – Clonal amplification is done by emulsion PCR, generating microbead-bound DNA clones. DNA polymerase is added, and nucleotides are washed. Deoxynucleoside triphosphate incorporation is then monitored by a pH sensor as protons are released.
- SOLiD – Clonal amplification is done by emulsion PCR. The DNA template, which is bound to microbeads, is flanked by adapters and is hybridized to the growing complementary strand by DNA ligase.
- cPAL – The anchor sequence and probes hybridize to the DNA template in a series of ligation reactions taking place on a nanoball.
All of these technologies have a common limitation – the inability to sequence long stretches of DNA. To sequence a large stretch of DNA using NGS, such as a human genome, the strands have to be fragmented and amplified. Computer programs are then used to assemble these random clones into a continuous sequence. Unfortunately, these amplification steps can introduce biases into the samples. Also, short-read sequencing can fail to generate a sufficient overlap between the DNA fragments. Overall, this means that sequencing a highly complex and repetitive genome, like that of a human, can be challenging using these technologies.
The multistep library preparation process is also a burden. For ensemble-based short-read sequencing, sample preparation usually involves:
Step #1: Extraction and purification of the DNA from the samples
Step #2: Fragmentation of the DNA
Step #3: Repair of frayed ends of the DNA
Step #4: Addition of adapters with ligases or transferases for solid-phase attachment
Step #5: The amplification of a single DNA molecule to generate millions of identical copies
Emulsion PCR and magnetic bead strategies help reduce this laborious process, but they are not fully exploited by labs currently due to the high costs.
In the last few years, there have been significant breakthroughs in the short-read space – including the announcement by Ultima Genomics that they had managed to achieve the $100 genome, just 8 years after Illumina’s HiSeq X Ten Sequencer breakthrough in 2014. Illumina has now countered with the launch of NovaSeq X Series, which promises to generate more than 20,000 whole genomes per year. Interestingly, PacBio has also recently entered the short-read space with the Onso system. This system is based on its unique sequencing by binding (SBB) technology and adds yet another element of competition to a market once dominated by Illumina.
Long-read sequencing
Long-read sequencing technologies are capable of reading longer lengths, between 5,000 and 30,000 base pairs. Therefore, they immediately address one of the main challenges faced by short-read sequencing. They sequence a single molecule, eliminating amplification bias, and generate a reasonable length to overlap a sequence for better sequence assembly.
There are two predominant long-read sequencing technologies. Pacific Biosciences have developed a Single Molecule Real-Time (SMRT) sequencer that generates reads that can exceed 10,000 bases in less than two hours. The sequencing takes place on a zero-mode waveguide chip, which are tiny structures that create highly confined optical observation volumes, whereby DNA polymerase is fixed at the bottom. Within each zero-mode waveguide chip, two adapters are attached to both ends of the DNA molecule to form a circular single-stranded structure. DNA polymerase is used to sequence a complementary strand and the fluorescence is measured to identify the corresponding nucleotides. In October 2022, PacBio unveiled its revolutionary new long-read sequencing system, Revio, which builds on SMRT technology to deliver 15 times more HiFi data and human genomes at scale for less than $1,000.
Oxford Nanopore Technologies’ platform is capable of producing reads of up to 1 million base pairs! This method relies on changes in the ion flow as nucleotides pass through a nanopore. Essentially, the DNA molecule is threaded through a bioengineered channel in a biological membrane – the electrical current across the channel is dependent on the specific nucleotide that is passing through at that time. This is then used to determine the base sequence.
A downside to long-read sequencing is that the accuracy per read can be much lower than that of short-read sequencing. The high error rate of nanopore technology is largely due to the inability to control the speed of the DNA molecules through the pore – these are systematic errors. Errors in SMRT sequencing are completely random. These can be reduced by circular consensus sequencing, which is a method that allows the DNA to pass through the zero-mode waveguide chip several times. By doing this, SMRT can generate highly accurate reads of at least 99.8%, similar to NGS platforms.
There are also problems with applying long-read sequencing to different genome lengths, as the data processing takes longer for organisms with larger genomes.
Despite this, progress is being made in the long-read space. PacBio recently announced its new HiFi sequencing method that it says is a new type of long-read sequencing technology allowing accuracy of 99.9% – on par with short reads and Sanger sequencing. This method is now available in its Revio long-read sequencing platform. In January 2022, Illumina stepped into the long-read market, unveiling a new, high performance long-read assay. As of September 2022, the Illumina Complete Long-Reads system launched (formerly known as “Infinity”), which promises to accelerate access to the remaining 5% of the genome without the need for a new platform. It is designed for WGS and is compatible with all NovaSeq systems. Illumina’s move into the long-read space certainly indicates exciting times are ahead.
Join David Smith (Mayo Clinic), Shawn Baker (SanDiegOmics) and Tiffany Boughtwood (Australian Genomics) as they discuss how the upcoming shifts within the NGS market could affect genomics research in the future and what impact it may have on healthcare around the world: Exploring the Current Sequencing Landscape
Combining sequencing technologies
Some challenges can be resolved by using a combination of short-read and long-read technologies. Silvia Fuselli, Assistant Professor of Genetics at the University of Ferrara in Italy, explained: “The assembly of short-reads for the major histocompatibility complex II DRB is almost impossible, especially in non-model species with no genomic reference available”. These cells are important in initiating immune responses and are highly variable. Fuselli continued, “Oxford Nanopore sequencers were used to cover the whole region without interruption; however, the error rate of the long reads was too high for a reliable variant calling. So, the team used short-read sequencing to adjust for any errors from certain regions of the long-reads. In our specific experiment, the two technologies compensate each other and using both dramatically increased the efficiency of our approach in terms of time and produced information”.
Both long-read and short-read methods of sequencing have pros and cons. Therefore, achieving the best results may sometimes require using both types of technologies. 10x Genomics have developed a method called ‘Linked-Reads’, which essentially provides long-range information from short-read sequencing data. It uses molecular barcodes to tag short-reads that originate from the same long DNA fragment, making it possible to link all of the short-reads together.
Linked-reads allow for a significantly increased physical coverage with only a slight increase of standard sequencing, ultimately providing better access to typically inaccessible parts of the genome and overcoming several of the limitations of short-read sequencing. 10x Genomics discontinued their kits for generating linked-reads in 2020. However, the concept of combining short- and long-read technologies to gain a deeper understanding of the genome still stands.
Hero Image credit: BioWorld