Image credit: Ardern, Wei and Nelson. 2021. ‘SARS-CoV-2: don’t ignore non-canonical genes’

Categories: Sanger Science10 August 2023

What is a gene?

A gene is a string of DNA letters that encode a protein - the building blocks of our cells and bodies. One gene encodes one protein. Simple, right? But scratch the surface into the molecular details and it soon gets complicated. Ongoing research, including work by scientists at the Wellcome Sanger Institute, is unpacking some of this complexity across diverse organisms - and viruses - and helping to reshape what we mean by a “gene”.

Zachary Ardern, a Sanger Epidemiological and Evolutionary Dynamics (SEED) postdoctoral fellow, aims to understand the fundamental questions of biology - “what is a gene?” and “what does it do?”. Here, he introduces us to his work and explores how the answers to these questions will impact research on evolution, COVID and beyond.

Sign up for our monthly email update

Sign up

Biologists have long known that all organisms have DNA sequences that do not encode proteins but still produce functional RNA molecules, called non-coding RNAs. There are several different kinds of non-coding RNA across different organisms, but many are poorly understood - and whether they should be called “genes” is unclear. Sometimes a sequence can even have one function as an RNA and a different function when turned into protein; these are called “dual-coding” or “dual-functioning” RNAs(1).

How DNA is transcribed into RNA and then translated into protein

Diagram of how DNA is transcribed into a single messenger RNA strand that is then translated into a string of amino acids to make a protein

Diagram of how DNA is transcribed into a single messenger RNA strand that is then translated into a string of amino acids to make a protein

Image credit: your genome.org

Examples of different types of RNA

Diagram showing a coding RNA strand that is translated into an amino acid sequence to produce a protein and two noncoding RNA strands: Ribosomal RNA which sits inside ribosomes in the cell to help it do its job, and transfer RNA whose shape allows it to grab amino acids and then line them up messenger RNA strands to make a protein

Diagram showing a coding RNA strand that is translated into an amino acid sequence to produce a protein and two noncoding RNA strands: Ribosomal RNA which sits inside ribosomes in the cell to help it do its job, and transfer RNA whose shape allows it to grab amino acids and then line them up messenger RNA strands to make a protein

Image credit: your genome.org

Biologists have long known that all organisms have DNA sequences that do not encode proteins but still produce functional RNA molecules, called non-coding RNAs. There are several different kinds of non-coding RNA across different organisms, but many are poorly understood - and whether they should be called “genes” is unclear. Sometimes a sequence can even have one function as an RNA and a different function when turned into protein; these are called “dual-coding” or “dual-functioning” RNAs(1).

How DNA is transcribed into RNA and then translated into protein

Diagram of how DNA is transcribed into a single messenger RNA strand that is then translated into a string of amino acids to make a protein

Diagram of how DNA is transcribed into a single messenger RNA strand that is then translated into a string of amino acids to make a protein

Image credit: your genome.org

Examples of different types of RNA

Diagram showing a coding RNA strand that is translated into an amino acid sequence to produce a protein and two noncoding RNA strands: Ribosomal RNA which sits inside ribosomes in the cell to help it do its job, and transfer RNA whose shape allows it to grab amino acids and then line them up messenger RNA strands to make a protein

Diagram showing a coding RNA strand that is translated into an amino acid sequence to produce a protein and two noncoding RNA strands: Ribosomal RNA which sits inside ribosomes in the cell to help it do its job, and transfer RNA whose shape allows it to grab amino acids and then line them up messenger RNA strands to make a protein

Image credit: your genome.org

Less well known than the existence of non-coding RNAs is that a single string of DNA can encode more than one gene - overlapping genes. This is possible because protein instructions are read from the nucleotides in triplets of letters, for example the DNA letters ‘ATG’ encode the protein letter methionine. Shifting the starting point by one or two nucleotides, or reading the sequence from the opposite DNA strand, results in a completely different protein sequence. While first discovered in virus genomes in the 1970s, recent work has shown the existence of overlapping genes in bacteria and beyond(2–4). As well as stable overlapping genes, such “alternative frame sequences” can be included into new proteins in a range of different ways by evolutionary processes(5).

Overlapping genes example

Diagram of the two strands of DNA, showing how two genes are coded for in one direction on the top strand, and how two genes are coded for in exactly the same stretch of DNA by transcribing the second strand of DNA in the opposite direction

Diagram of the two strands of DNA, showing how two genes are coded for in one direction on the top strand, and how two genes are coded for in exactly the same stretch of DNA by transcribing the second strand of DNA in the opposite direction

Image credit: Vanderhaeghen S et al. 2018. Scientific Reports 8 (1): 17875.

Less well known than the existence of non-coding RNAs is that a single string of DNA can encode more than one gene - overlapping genes. This is possible because protein instructions are read from the nucleotides in triplets of letters, for example the DNA letters ‘ATG’ encode the protein letter methionine. Shifting the starting point by one or two nucleotides, or reading the sequence from the opposite DNA strand, results in a completely different protein sequence. While first discovered in virus genomes in the 1970s, recent work has shown the existence of overlapping genes in bacteria and beyond(2–4). As well as stable overlapping genes, such “alternative frame sequences” can be included into new proteins in a range of different ways by evolutionary processes(5).

Overlapping genes example

Diagram of the two strands of DNA, showing how two genes are coded for in one direction on the top strand, and how two genes are coded for in exactly the same stretch of DNA by transcribing the second strand of DNA in the opposite direction

Diagram of the two strands of DNA, showing how two genes are coded for in one direction on the top strand, and how two genes are coded for in exactly the same stretch of DNA by transcribing the second strand of DNA in the opposite direction

Image credit: Vanderhaeghen S et al. 2018. Scientific Reports 8 (1): 17875.

It is difficult to work out which nucleotide letters encode proteins. The bacterial genomes that my research is focused on contain a wealth of hidden small proteins, which remain largely undiscovered(6). Even for the extremely well-studied virus that causes COVID-19, SARS-CoV-2, we surprisingly don’t know how many genes are encoded, with a number of sequences having an ambiguous status somewhere between genes and non-genes(7). A PhD student at the Sanger Institute, IChing Tseng, is extending a line of research on overlapping genes in the virus which was started early in the pandemic(8). Understanding how these sequences evolve could bring new insights into how the virus interacts with human immune systems, and differences between the different variants of SARS-CoV-2 that have evolved(9).

How two different proteins are coded for in the same messenger RNA strand in SARS-CoV-2

Diagram showing how one strand of messenger RNA can produce two completely different proteins simply by starting translation into protein just one RNA base further along

Diagram showing how one strand of messenger RNA can produce two completely different proteins simply by starting translation into protein just one RNA base further along

Image credit: Ardern, Wei and Nelson. 2021. 'SARS-CoV-2: don't ignore non-canonical genes'

It is difficult to work out which nucleotide letters encode proteins. The bacterial genomes that my research is focused on contain a wealth of hidden small proteins, which remain largely undiscovered(6). Even for the extremely well-studied virus that causes COVID-19, SARS-CoV-2, we surprisingly don’t know how many genes are encoded, with a number of sequences having an ambiguous status somewhere between genes and non-genes(7). A PhD student at the Sanger Institute, IChing Tseng, is extending a line of research on overlapping genes in the virus which was started early in the pandemic(8). Understanding how these sequences evolve could bring new insights into how the virus interacts with human immune systems, and differences between the different variants of SARS-CoV-2 that have evolved(9).

How two different proteins are coded for in the same messenger RNA strand in SARS-CoV-2

Diagram showing how one strand of messenger RNA can produce two completely different proteins simply by starting translation into protein just one RNA base further along

Diagram showing how one strand of messenger RNA can produce two completely different proteins simply by starting translation into protein just one RNA base further along

Image credit: Ardern, Wei and Nelson. 2021. 'SARS-CoV-2: don't ignore non-canonical genes'

The concept of sequences of ambiguous coding status is gaining increasing attention across diverse biological systems, under names such as the “ghost proteome”(10), the “dark proteome”(11) or the “noncanonical translatome”(12–14). In the human genome, researchers have shown that these sequences play a role in cancer(15–18). Relatively little work has been in bacteria(19,20), where my work is focused.

A related world of research is investigating what these non-canonical proteins, and other poorly understood genes, actually do in the cell. What, if anything, is their function? This is a question that can increasingly be addressed with the kinds of high-throughput biological data which the Sanger Institute excels at producing. With the aid of new methods in artificial intelligence, the massive datasets available can be leveraged for new functional insight. This includes studying bacterial proteins, where often a quarter or more of genes in a genome have no known function(21).

The apparently very basic questions of; “what is a gene?” and “what does it do?” remain under active investigation, and will have implications for some of the most important topics in biological research.

Find out more

References

1. Neuhaus, K. et al. Differentiation of ncRNAs from small mRNAs in Escherichia coli O157:H7 EDL933 (EHEC) by combined RNAseq and RIBOseq - ryhB encodes the regulatory RNA RyhB and a peptide, RyhP. BMC Genomics 18: 216 (2017).
2. Kreitmeier, M. et al. Spotlight on alternative frame coding: Two long overlapping genes in Pseudomonas aeruginosa are translated and under purifying selection. iScience 25: 103844 (2022).
3. Ardern, Z., Neuhaus, K. & Scherer, S. Are Antisense Proteins in Prokaryotes Functional? Front Mol Biosci 7: 187 (2020).
4. Wright, B. W., Molloy, M. P. & Jaschke, P. R. Overlapping genes in natural and engineered genomes. Nat. Rev. Genet. 23: 154–168 (2022).
5. Ardern, Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty. J. Mol. Evol. (2023) doi:10.1007/s00239-023-10122-3.
6. Ardern, Z. Small proteins: overcoming size restrictions. Nat. Rev. Microbiol. 20: 65 (2022).
7. Jungreis, I. et al. Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: A homology-based resolution. Virology 558: 145–151 (2021).
8. Nelson, C. W. et al. Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic. Elife 9: (2020).
9. Ardern, Z., Wei, X. & Nelson, W. C. SARS-CoV-2: don’t ignore non-canonical genes. Virological https://virological.org/t/sars-cov-2-dont-ignore-non-canonical-genes/740/2 (2021).
10. Cardon, T., Fournier, I. & Salzet, M. Shedding Light on the Ghost Proteome. Trends Biochem. Sci. 46: 239–250 (2021).
11. Wright, B. W., Yi, Z., Weissman, J. S. & Chen, J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 32: 243–258 (2022).
12. Wacholder, A. et al. A vast evolutionarily transient translatome contributes to phenotype and fitness. Cell Syst 14: 363–381.e8 (2023).
13. Rich, A., Acar, O. & Carvunis, A.-R. Exploring the noncanonical translatome using massively integrated coexpression analysis. bioRxiv 2023.03.16.533058 (2023) doi:10.1101/2023.03.16.533058.
14. Ardern, Z. & Uz-Zaman, M. H. Between noise and function: Toward a taxonomy of the non-canonical translatome. Cell systems 14: 343–345 (2023).
15. Othoum, G. & Maher, C. A. CrypticProteinDB: an integrated database of proteome and immunopeptidome derived non-canonical cancer proteins. NAR Cancer 5, zcad024 (2023).
16. Posner, Z., Yannuzzi, I. & Prensner, J. R. Shining a light on the dark proteome: Non-canonical open reading frames and their encoded MiniProteins as a new frontier in cancer biology. Protein Sci. e4708 (2023).
17. Ouspenskaia, T. et al. Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer. Nat. Biotechnol. 40: 209–217 (2022).
18. Hofman, D. A. et al. Translation of non-canonical open reading frames as a cancer cell survival mechanism in childhood medulloblastoma. bioRxiv (2023) doi:10.1101/2023.05.04.539399.
19. Smith, C. et al. Pervasive translation in Mycobacterium tuberculosis. Elife 11: (2022).
20. Zehentner, B., Ardern, Z., Kreitmeier, M., Scherer, S. & Neuhaus, K. Evidence for Numerous Embedded Antisense Overlapping Genes in Diverse E. coli Strains. bioRxiv 2020.11.18.388249 (2020) doi:10.1101/2020.11.18.388249.
21. Ardern, Z., Chakraborty, S., Lenk, F. & Kaster, A.-K. Elucidating the functional roles of prokaryotic proteins using big data and artificial intelligence. FEMS Microbiol. Rev. 47: (2023).