Mobile Menu

Shedding Light on the ‘Dark Proteome’

Written by Kirsty Oswald, Science Writer 

Omics-based studies have revealed that genomes, including our own, contain a vast number of unannotated open reading frames (ORFs), meaning that we may have significantly underestimated their protein-coding potential. 

These stretches of DNA have the potential to encode ‘hidden proteins’, in particular microproteins, which could have important biological activity. But the function of most of these proteins remains uncharacterised and they are generally under-recognised in the scientific community. Hence, collectively, they have gained the moniker the ‘dark proteome’. 

Research, particularly proteomics using mass spectrometry and ribosome profiling have helped increase recognition of the dark proteome and is also changing our understanding of fundamental gene biology.  

In a recent paper in Trends in Cell Biology, Bradley Wright from the University of Texas, Dallas, and colleagues, sum up the current state of knowledge in this emerging area. 

Figure from Wright et al. Trends Cell Biol 2022;32:243-258. Licensed under CC BY-NC-ND 4.0. 

Short ORFs 

An important example of noncanonical ORFs are short ORFs (sORFs). These encode proteins that may only have 40 or 50 amino acids in their sequence and, historically were thought unlikely to be functional. But the past few years have revealed important roles for sORFs in physiology. The microproteins they encode are well suited to regulatory functions, such as forming regulatory subunits of protein complexes or acting as signalling molecules.  

A particularly rich source of sORFs encoding microproteins has been, paradoxically, in long noncoding RNA (lncRNA; Fig 1A). Examples of lncRNA-located sORFs illustrate that transcripts can have dual noncoding and coding functions. 

Another instance in which RNA thought to be noncoding has proved otherwise is in the case of circular RNAs (circRNA; Fig 1B) – these are enclosed RNA molecules that, lacking the 5’ cap critical for translation, were assumed to be noncoding. However, it has emerged that these can encode functioning microproteins thanks to cap-independent initiated mechanisms.  

Untranslated regions 

sORFs in the 5’-UTR are known as upstream ORFs (uORFs; Fig 1C). The start codon in mammals is usually AUG, but in uORFs, an alternative start codon sits upstream, resulting in an overlapping reading frame with a canonical ORF. These have been recognised for decades as important regulatory elements that can fine-tune the translation of downstream coding sequences but, more recently, examples of emerged of them producing functional peptides. 

sORFs derived from the 3’-UTR, known as downstream ORFs (dORFs; Fig 1C) are less common, and whether they encode functional microproteins is yet to be established. 

Overlapping ORFs 

Another example of noncanonical ORFs are overlapping ORFs (Fig 1D) in which there are alternative start codons upstream or downstream of the canonical AUG start codon. For example, ORFs have been found fully nested within annotated coding sequences, and these have been linked to human disease. In-frame overlapping leads to truncated or extended proteins with an altered function of the canonical protein, while out-of-frame overlapping creates proteins with different functions. 

Only the beginning 

The researchers highlight that the recent advances have transformed our understanding of translation in eukaryotes but much more is still to come. The many examples discovered so far, showing the roles of noncanonical coding sequences in humans, “should emphasize that we still do not have an accurate estimation of the protein-coding capacity of the human genome,” the team write. As further research is completed, they predict we will see more exciting discoveries about translation phenomena come to the forefront. 

Image Credit: Canva


More on these topics

Dark Genome / Proteome / Proteomics