RNA sequencing is an NGS technique that enables investigation of the transcriptome – the total cellular content of RNAs, including messenger RNA (mRNA), ribosomal RNA (rRNA) and transfer RNA (tRNA). Exploring the transcriptome is crucial for understanding the connection between the genome and its functional protein expression. RNA sequencing can reveal which genes are turned on in a cell, what their level of expression is and when they are activated. This allows scientists to assess changes in cell biology that may cause disease. Also, the transcriptome captures information about splicing events and post-transcriptional modifications that occur during mRNA processing, such as polyadenylation and 5’ capping, all of which are not detected by DNA sequencing.
Recent advances in RNA sequencing workflow, from sample preparation to data analysis, have enabled researchers to further investigate the functional complexity of transcription. Typically, mRNA molecules are the most studied RNA species because they encode proteins. But recent advances in NGS technology have generated a focus on non-coding RNAs too, such as:
- rRNAs and tRNAs (involved in mRNA translation).
- Small nuclear RNAs (involved in splicing).
- Small nucleolar RNAs (involved in the modification of rRNAs).
Even more recently, novel classes of small non-coding RNAs, such as microRNA (miRNA) and piwi-interacting RNA (piRNA), have been discovered and are now being examined. Both miRNAs and piRNAs regulate gene expression at the posttranscriptional level. Long noncoding RNAs (lncRNAs) are also a newly studied group of noncoding RNAs, with functions including chromatin remodelling, transcriptional control and posttranscriptional processing.
The first step in transcriptome sequencing is the isolation of RNA from a biological sample. Next, the RNA sequencing library must be assembled – the process can vary greatly depending on the selection of RNA species and between NGS platforms. Typically, the library construction consists of reverse transcribing the desired RNA molecules into complementary DNA (cDNA), followed by the fragmentation and amplification of the cDNA and attachment of adapters.
Several NGS platforms are commercially available for RNA sequencing. Currently, the Illumina HiSeq platform is the most commonly applied NGS technology for RNA sequencing – the sequencing takes between 1.5 and 12 days to complete, depending on the total read length of the library. Recently, Illumina released the MiSeq, which is a desktop sequencer that generates around 30 million paired-end reads in 24 hours, offering a rapid turnaround for transcriptome sequencing.
Transcriptome analysis is necessary to transform a collection of short sequencing reads into a set of full-length transcripts that can be studied to provide an insight into the huge complexity of transcriptomes. Theoretically, longer sequencing reads make it simpler to assemble transcripts and enable the accurate detection of alternative splicing isoforms, which are areas of the transcript that are differentially joined.
Typically, following RNA sequencing, reads are aligned to a reference genome and then assembled into transcripts using reference transcript annotations. Then, the expression level of each gene is estimated by counting the number of reads that align to each exon or full-length transcript.
Steps of transcriptome anaylsis:
Step #1: Read alignment
Conventional read mapping algorithms that are used for mapping DNA sequencing reads are not recommended for RNA sequencing reads due to their inability to handle spliced transcripts. Instead, reads are usually mapped with a ‘splicing-aware’ aligner, many of which have been developed specifically for mapping transcriptome data. Some examples include GSNAP, MapSplice, RUM, STAR and TopHat. Each have different advantages, depending on the overall objectives of the RNA sequencing study.
Step #2: Transcript assembly
Transcript assembly – After the RNA sequences have been aligned, the mapped reads are assembled into transcripts using computational programmes that infer transcription models from a reference genome. Typically, the reconstruction of transcripts from short-read data is more challenging due to the complexity of the transcriptome, subsequently affecting assembly quality. Software is then used to estimate gene expression levels. Cufflinks and MISO are computational tools that quantify expression by counting the numbers of reads that map to full-length transcripts. HTSeq quantifies expression by counting the number of reads that map to an exon.
Step #3: Quality assessment
Bias can arise during RNA extraction, sample preparation, library construction, sequencing and read mapping. Therefore, the quality of data should be evaluated during these various stages. The raw sequence data should be checked against numerous parameters, such as the sequence diversity of reads, adapter contamination, base qualities and nucleotide composition. After aligning the reads, additional parameters should be assessed, including percentage of reads mapped to the transcriptome, percentage of reads with a mapped mate pair and the chromosomal distribution of reads.
Step #4: Expression analysis
To detect transcripts showing differential gene expression across various conditions, a variety of statistical methods have been designed that can make specific assumptions based on the RNA sequencing data. Allele-specific expression can be detected by counting the number of reads containing each allele and applying a statistical test. RNA sequencing technologies can also be used to analyse expression quantitative trait loci (eQTLs) – genetic loci that explain a fraction of gene expression variation. The goal of their analysis is to uncover underlying biological processes and to discover genetic variants causing disease. Many eQTL mapping methods have been applied to RNA sequencing data, enhancing the ability to identify causal variants.
Overall, RNA sequencing has a wide variety of applications, and no single analysis pipeline can be used in every case. Experimental design, quality control, read alignment, differential gene expression, alternative splicing and eQTL mapping steps all differ depending on the application of the sequencing data and the RNA species being analysed. Therefore, scientists plan experiments and adopt different analysis strategies based on the organism being studied and their research goals.
Image credit: Spectrum