Thanks to the introduction of high-throughput next generation sequencing (NGS) methods, generating nucleic acid sequencing data has never been easier. Now, DNA and RNA samples can be sequenced within hours, creating millions of DNA fragments (known as reads), which can then be aligned to a reference genome.
Quality control is an essential step in any NGS workflow, allowing the integrity and quality of the data to be checked before downstream analysis and interpretation. Depending on the sequencing technology selected, additional data processing steps may also be required to remove low-quality reads from the dataset.
What causes poor-quality NGS data?
Several biological and technical factors may influence the quality- and read-depth of sequencing data. These may include:
- Poor quality of starting material – Assessing the quality of the starting material is a critical first step of the NGS workflow. Sample concentration and purity will affect the downstream library preparation and sequencing steps, so ensuring that the starting material is high quality is vital to NGS success. This is especially important in RNA sequencing, where sample contamination is a common issue.
Nucleic acid quantification can be directly measured post-extraction using spectrophotometers, such as the Thermo Scientific NanoDrop. By measuring the UV absorbance of the sample, spectrophotometers can also provide an A260/A280 value that provides information about sample contamination. Generally, a ratio of ~1.8 or higher indicates a high purity DNA sample, whereas for RNA a ratio of ~2.0 is desirable1.
Alternatively, nucleic acid quantification can be done using electrophoresis methods. A popular instrument for electrophoresis quantification is the Agilent TapeStation, which can produce an RNA integrity number (RIN)2. Ranging from 1 (low integrity) to 10 (high integrity), this score provides an assessment of RNA quality, identifies sample degradation and provides a standardized way to compare samples.
- Improper library preparation – After nucleic acid is extracted, the next step of the NGS workflow is library preparation. Protocols for library preparation vary depending on the sample type, sequencing method and NGS platform used. However, all library preparation methods benefit from careful quality control checks. This ensures that samples meet the specific requirements set by the NGS provider.
Specifically, library preparation quality control checks are used to determine the size, distribution and integrity of the library before sequencing. Therefore, selecting a suitable NGS library preparation kit that is compatible with the sample and downstream sequencing required is essential. Some RNA sequencing pipelines require that certain RNA species are removed or enriched during library preparation (e.g. rRNA depletion and poly(A) selection), as this will enable the RNA population of interest to be sequenced more readily.
Sample contamination is another pressing concern in NGS. When sequencing multiple samples, cross-contamination during library preparation can significantly impact the downstream sequencing analysis. Contamination can be minimized by reducing the number of library preparation steps and introducing automated library preparation where possible.
- Technical sequencing errors – No sequencing instrument is perfect. Errors can occur with sequencing instruments or the associated hardware mid-run, which could affect the quality of the read data produced. If this should occur, contact your sequencing platform provider for tailored troubleshooting advice.
Looking for more information about NGS library preparation? Check out our comprehensive guide.
Quality control metrics
There are numerous metrics that can be used to assess the quality of raw sequencing reads before secondary analysis:
- Yield: The total number of reads per run.
- Read analysis: Read length, GC content, adapter content and duplication can be measured to give an indication of data quality.
- Q score: A score that determines the probability that an incorrect base was called during the run. This is determined by Q = –10 log10 P. A Q score of above 30 is generally considered to be good quality for most sequencing experiments.
- Error rate (%): The percentage of bases incorrectly called during one cycle. As read length increases, error rate also increases.
- Clusters Passing Filter (%): Illumina sequencers provide an indication of signal purity. This cluster passing filter percentage refers to the number of clusters that passed the “chastity filter” of the instrument3. Generally, a lower PF percentage is associated with lower yield.
- Phas/Prephas (%): Another commonly used Illumina metric, referring to the percentage of base signal lost in each sequencing cycle. This is calculated by analysing the number of clusters that fall behind (phasing) or move ahead (prephasing) during sequencing.
In addition to the quality control tools that may be integrated into sequencing instruments, there are a number of computational tools that are freely available for raw read data quality analysis. These tools may be run from the command line, or alternatively, many popular quality control tools are available to use on web-based platforms such as Galaxy.
The basics of FASTQ format
Commonly, sequencing instruments produce raw read data in FASTQ format (.fastq). Like a FASTA file, FASTQ files contain nucleotide sequence information, presented as a line of A, T, G or C bases with N representing an unknown base. However, in addition to sequence information, FASTQ files also contain the quality score for each base in the sequence (see Figure 1). These numbers are represented using ASCII characters, ranging from ! (quality score of 0) to K (quality score of 42).
Figure 1: An example of a typical FASTQ file. Image sourced from Hosseini et al, 20164.
FastQC: A quick look at the raw read data
One of the most well-known quality analysis tools is FastQC, a program designed to spot potential problems in raw read data from high-throughput sequencing. By importing the sequencing data into the program as a BAM, SAM or FASTQ file, FastQC generates helpful data about read length, quality and base content. This information is displayed in the form of helpful graphs and plots, which provide a quick overview of the quality of the data and possible areas of concern.
In particular, the “per base sequence quality” graph is a helpful indicator of overall read quality (see Figure 2). This graph shows the distribution of quality scores (ranging from 0 to >40) at each position in the read, across all reads generated. Read quality scores of >20 are generally considered acceptable for most applications. Typically, read quality decreases as the length of the read increases. Any abnormal decrease in read quality could be indicative of an error with the sequencing instrument during the run.
Figure 2: An example of a FASTQC report showing poor per base sequence quality. Image sourced from Andrews, 20235.
Read trimming and filtering
If the initial quality control readouts look acceptable, a common next step involves the trimming and filtering of reads to remove low-quality data. Typically, base quality will decrease towards the 3’ end of the read (occasionally referred to as the “tail”) due to the sequencing process. If poor quality reads are included in the dataset, this could lead to accuracy problems in the downstream mapping algorithms. Therefore, reads with poor quality tails should be selected and trimmed, leaving only high-quality reads for alignment. Read trimming is therefore an essential step to maximise the number of reads that can be successfully aligned to a reference genome.
Various tools can be used to trim and filter low quality reads. Popular packages include CutAdapt and FASTQ Quality Trimmer that can be easily used in Galaxy by specifying the quality threshold (commonly set to 20) of the read. This will remove any bases with a quality score below 20. Subsequently, the reads can be filtered to remove any reads which are below a certain length (e.g. <20 bases).
Additionally, adapter sequences may also be present within the read data that need to be removed. For example, Illumina sequencing utilises short adapter sequences ligated to the 3’ and 5’ end of the DNA strand, which are a necessary component of the sequencing reaction. Adapter contamination can occur when the DNA being sequenced is shorter than the length of the read – resulting in the adapter sequence being incorporated into the read.
Removal of adapter sequences is a relatively easy process. If the adapter sequence is known, tools such as CutAdapt or Trimmomatic can be run to remove unwanted adapter regions from the read data. For example, adapters used in Illumina sequencing can be found here. Additionally, such tools can also be used to trim other undesirable sequences from the reads, including poly(A) tails and primers. Note that read trimming must be conducted before adapter removal.
As a final step, the cleaned read files can again be analysed using FastQC to determine if the quality of the reads is increased. Importantly, there should be no adapter dimers present within the dataset.
Long-read quality control
Perhaps unsurprisingly, the long reads generated from platforms such as Oxford Nanopore Technologies can be analysed with specific packages that can handle the long and variable nature of these reads. One such quality control tool, Nanoplot, is a popular method to generate high-quality plots that visualise long read quality and length. Nanoplot also provides a statistical summary document that outlines the key features of the dataset. Similarly, PycoQC is another tool designed specifically for Oxford Nanopore data that produces interactive and highly customisable quality control plots for long-read datasets.
To trim and filter long reads, Nanofilt (or its updated counterpart, Chopper) are helpful tools that can be run straight from the command line. If using Galaxy, a number of tools are available for Oxford Nanopore data (including Porechop for adapter removal and Filtlong for read filtering), all of which are included in the NanoGalaxy workflow.
1. Matlock, B. & Scientific, T. F. Assessment of nucleic acid purity. https://assets.thermofisher.com/TFS-Assets/CAD/Product-Bulletins/TN52646-E-0215M-NucleicAcid.pdf.
2. Mueller, O., Lightfoot, S. & Schroeder, A. RNA integrity number (RIN) -standardization of RNA quality control. https://www.agilent.com/cs/library/applications/5989-1165EN.pdf (2006).
3. Illumina Inc. Optimizing cluster density on illumina sequencing systems. https://www.illumina.com/content/dam/illumina-marketing/documents/products/other/miseq-overclustering-primer-770-2014-038.pdf (2016).
4. Hosseini M, Pratas D and Pinho AJ (2016) A Survey on Data Compression Methods for Biological Sequences. Information. An International Interdisciplinary Journal 7(4). Multidisciplinary Digital Publishing Institute: 56.
5. Andrews S (n.d.) Babraham Bioinformatics – FastQC A Quality Control tool for High Throughput Sequence Data. Available at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed 13 October 2023).