Mobile Menu

How to: name a gene

A recent article in Nature Genetics has set out the HUGO Gene Nomenclature Committee (HGNC) guidelines for human gene nomenclature. 

The first guidelines for human gene nomenclature were published in 1979. Several revisions have since been published to reflect major changes and increased knowledge and data. To date, over 40,000 human loci have been named by the HGNC, half of which are protein-coding genes. Researchers, however, have made substantial progress in naming other regions of the genome, including different classes of RNA genes and pseudogenes.

With evolving technology and new functional information, the need for stable gene symbols is critical. In this article, HGNC present guidelines and recommendations for gene nomenclature. However, they do note that if submitters can give sufficient evidence for deviation from these guidelines, considerations may be made.

A summary of the guidelines is below:

Gene naming

  • Gene symbols are short, memorable and pronounceable. They should convey something about the character or function of the gene products.
  • All genes are assigned a unique symbol – HGNC ID and descriptive name.
  • Gene symbols must:
    • Only contain uppercase Latin letters and Arabic numerals
    • Not be offensive or pejorative
    • Avoid being the same as a commonly used abbreviation
    • Not reference species or use ‘G’ for gene
  • HGNC do not routinely name isoforms. If authors use their own isoform annotation, they should clearly denote this annotation and then quote the HGNC symbol for that gene.

Gene naming by biotype

  • Protein-coding genes:
    • Named on the basis of a key normal function of the gene product.
    • Gene group members should be designed by Arabic numerals placed immediately after the root symbol, e.g. KLG1, KLF2 and KLF3. Some large families may include a variety of number and letter combinations to indicate subgroupings.
    • In the absence of functional data, naming may be based on: (1) Recognised structural domains and motifs, (2) Homologous genes within the human genome, (3) Homologous genes from another species and (4) The presence of an open reading frame.
  • Pseudogenes:
    • HGNC only name pseudogenes that retain homology to a substantial proportion of the functional ancestry genes.
    • Most pseudogenes are named after a specific parent gene, e.g. DPP3P1, ‘DPP3pseudogene 1’.
    • Pseudogenes that retain most of the coding sequence, as compared with other family members, are named as a new family member with a ‘P’ suffix, e.g. CBWD4P, ‘COBW domain containing 4, pseudogene’. This naming format is also used for genes that are pseudogenised relative to their functional ortholog in another species.
  • Non-coding RNA genes:
    • Named according to their RNA type.
    • Long non-coding RNAs, wherever possible, are named on the basis of a key function or characteristic.
  • Readthrough transcripts:
    • HGNC only name those that are consistently annotated by the RefSeq annotators at the National Center for Biotechnology Information (NCBI) and the GENCODE annotators at Ensembl.
  • Gene segments:
    • For specific complex loci, HGNC will assign symbols to individual segments, solely on the basis of community request.
  • Genomic regions:
    • HGNC no longer routinely provide symbols for genomic regions.

Other considerations

All HGNC gene records will have a status. Names no longer used will have a status reading ‘entry withdrawn’. To avoid confusion, researchers should no longer use these symbols with this status.

When naming across vertebrates, HGNC recommend that orthologous genes should have the same symbol. The Vertebrate Gene Nomenclature Committee, an extension of the HGNC, assign standardized names to genes in vertebrate species that currently lack a nomenclature committee.

The HGNC endorse the use of italics to denote genes, alleles and RNAs to distinguish them from proteins. They advise authors to quote the approved gene symbol for all genes at least once in the abstract of any publication to avoid ambiguity. Genes with an approved symbol also have a unique HGNC ID in the format HGNC:number, e.g. BRAF, HGNC:1097.

The HGNC, despite aiming to minimise symbol changes, acknowledge that some updates will still be appropriate. For example, symbols that Microsoft Excel have affected have been changed (e.g. MARCH1 is now MARCHF1). The HGNC consider these requests on a case-by-case basis and involves community consultation.

Image credit: https://www.flaticon.com/authors/freepik from https://www.flaticon.com  


More on these topics

Advice / Gene / Guidelines