- The Answer Depends on the Definition of a Gene
- Originally, the Term Gene Meant Protein-Coding mRNA, But Now it is Expanding to Include Long Non-Coding RNA
- Comparative Genomics Puts the Human Gene Count Between That of a Chicken and a Grape
The title of this blog is a seemingly simple rhetorical question. One might think it ought to have a definite answer, given the fact that it’s been roughly 20 years since the human genome was first sequenced. Despite this and the significant advances in sequencing since then, the question has remained one of interest for more than 50 years, as told by James D. Watson in his book titled DNA: The Story of the Genetic Revolution, which inspired me to select this blog topic.
In researching the literature, I found numerous publications by Steven L. Salzberg, a Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University School of Medicine in Baltimore, MD. One of the many research interests listed on Salzberg’s website is “the development of new computational methods for analysis of DNA from the latest sequencing technologies…and applied software to many problems in gene finding, genome assembly, comparative genomics, [and>
evolutionary genomics…”. Below is some background information on gene finding that I selected from Salzberg’s published work, followed by his recent results for the number of human genes, and then some comparative genomic facts that might surprise you.
A used postage stamp from the UK commemorating the completion of the Human Genome Project, circa 2003.
According to Salzberg and coworkers, in the decade preceding the initial publication of the human genome, multiple estimates on the number of human genes were made based on sequencing of short messenger RNA (mRNA) fragments. Most of these estimates fell in the range of 50,000–100,000 genes. When the human genome was published in 2001, the estimates of the gene count were dramatically lower: the public effort by the International Human Genome Sequencing Consortium (IHGSC) reported 31,000 genes, while the private effort led by Craig Venter at Celera reported 26,588 plus ~12,000 genes with “weak supporting evidence.”
As the genome became gradually more complete and the annotation was improved, the number of human genes continued to fall: when the first major genome update was published by the IHGSC in 2004, the estimated gene count was revised to 24,000. Later efforts suggested that the true number of protein-coding genes was even smaller: a 2007 comparative genomics analysis suggested 20,500 genes, and a proteomics-based study in 2014 estimated 19,000 genes. Salzberg and coworkers further note that “[o>
ne striking feature of most early attempts to catalog all human genes was their lack of precision. Most estimates have only one to two significant digits, indicating major uncertainty about the exact number.” For example, a gene count reported to be 20,000 could have a range of variance from ±200 to ±2,000.
Original vs. Current Definition of a Gene
In the past, the textbook definition of a gene was typically described as a segment of DNA that is transcribed into mRNA, which is in turn translated into the genetically encoded protein, i.e. the “central dogma of molecular biology” attributed to Francis Crick in 1957. During the Human Genome Project about 40 years later, most efforts to estimate and annotate genes still focused on protein-coding genes. At that time, most scientists assumed that non-coding genes represented only a very small portion of the functional elements of the human genome, and that most RNA genes (e.g., transfer RNAs and ribosomal RNA genes) were already known.
Taken from commons.wikimedia.org and free to use.
However, a few years after the initial publication of the human genome, scientists began to uncover a large and previously unappreciated complement of long non-coding RNA (lncRNA) genes, depicted here, which quickly grew to include thousands of novel genes. These genes have a wide range of functions that are just as vital to human biology as many protein-coding genes, “and any comprehensive list of human genes should include them,” according to Salzberg and coworkers. The importance of lncRNA can be appreciated by consulting a paper by Kung et al. titled Long Noncoding RNAs: Past, Present, and Future.
Why Gene Count Matters
Before getting to the latest count of human genes that includes mRNA and lncRNA, it is worth recognizing, according to Salzberg and coworkers, why “the human gene list has a tremendous impact on biomedical research.” A huge and still growing number of genetic studies depend on this list, including:
- Exome sequencing projects use exon capture kits that target most “known” exons. Any exons that are not listed in standard human annotation are ignored.
- Genome-wide association studies (GWAS) attempt to link genetic variants to nearby genes, relying on standard catalogs of human genes.
- Many software packages that analyze RNA sequencing (RNA-seq) experiments, which measure gene expression, rely on a database of known genes and cannot measure genes or splice variants unless they are included in the database.
- Efforts to identify cancer-causing mutations usually focus on mutations that involve known genes, ignoring mutations that occur in other regions.
They add that these and other examples encompass thousands of experiments and an enormous investment of time and effort. The creation of a more complete, accurate human gene catalog has an impact on many of these studies. For example, exome sequencing studies targeting Mendelian diseases, which should be the easiest diseases to solve, have reported diagnostic success in only about 25% of cases, perhaps because many exons and genes are excluded from exome capture kits, according to Salzberg and coworkers. They conclude, therefore, that “[a>
better gene list may also help explain the genetic causes of the many complex diseases that have thus far remained largely unexplained, despite hundreds of large GWAS and other experiments.”
For Salzberg and coworkers, the operational definition of a “gene” includes “any interval along the chromosomal DNA that is transcribed and then translated into a functional protein, or that is transcribed into a functional RNA molecule.” By ‘functional,’ they “mean to include any gene that appears to perform a biological function, even one that might not be essential.” They acknowledge that the proper determination of function can be a lengthy, complex process, and that at present, the function of many human genes is unknown or only partially understood. Importantly, their definition intentionally excludes pseudogenes, which are gene-like sequences that may arise through DNA duplication events, as depicted here, or through reverse transcription of processed mRNA transcripts.
Taken from commons.wikimedia.org and free to use.
In summary, the total gene count for Salzberg and coworkers corresponds to the “total number of distinct chromosomal intervals, or loci, that encode either proteins or noncoding RNAs; in addition, [they>
report the total number of gene variants, which includes all alternative transcripts expressed at each locus.” This alternative splicing is shown here as a representative example.
Taken from commons.wikimedia.org and free to use.
CHESS, which stands for Comprehensive Human Expressed Sequences, is the name of the latest human gene catalog assembled by Salzberg and coworkers. CHESS is a new analysis of a large, comprehensive survey of gene expression in human tissues, i.e. the genotype-tissue expression (GTEx) study, which includes samples from dozens of tissues collected from hundreds of individuals. All of these samples were subjected to deep RNA-sequencing, with tens of millions of sequences (“reads”) captured from each sample, in an effort to measure gene expression levels across a broad range of human cell types.
This exceptionally large set of transcript data—just under 900 billion reads—provided an opportunity to construct a new set of human genes and transcripts. Analysis of this massive amount of transcript data was accomplished by assembling all of the samples, merging the results, and applying a series of computational filters to remove transcripts with insufficient evidence. The results are summarized in Table 1 shown here.
Taken from Pertea et al. Genome Biology 2018 19:208 © The Author(s). Open Access: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium.
Interested readers can consult Salzberg and coworkers for details and a full discussion, but for now, I will just touch on some of the essential findings. The new CHESS database contains 42,611 genes. 20,352 of these are potentially protein-coding, and 22,259 are noncoding. The database also contains a total of 323,258 transcripts, which include 224 novel protein-coding genes, and 116,156 novel transcripts. Over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealed “a heretofore unappreciated amount of transcriptional noise in human cells.”
As previously argued by Palazzo and Lee, “although there are undoubtedly many more functional ncRNAs yet to be discovered and characterized, it is also likely that many of these transcripts are simply junk.” Salzberg and coworkers add that “the mere fact that a sequence is transcribed is insufficient evidence to conclude that it is a gene, despite the fact that early genomics studies made precisely that assumption. It appears instead that 95% of the transcribed locations in the human genome are merely transcriptional noise, explained by the nonspecific binding of RNA polymerase to random or very weak binding sites in the genome.” To me, these molecular events are formally analogous to the random digital signal error pictured here.
After reading the aforementioned literature on the gene count for humans, you might wonder how gene count for humans compares to other organisms. A reported answer that I found surprised me. Namely, humans (~22,300) are between a chicken (~16,700) and a grape (~30,400), which seems counterintuitive to me based on the relative complexity of each organism. These findings, and those for a wide variety of other organisms, are depicted here as a bar graph.
Taken from Pertea and Salzberg Genome Biology 2010 11:206 © BioMed Central Ltd and free to use via flikr.com.
Viruses such as influenza are among the simplest living entities. They have only a handful of genes but are exquisitely well adapted to their environments. Bacteria such as Escherichia coli have a few thousand genes, and multicellular plants and animals have two to ten times more than that. According to Pertea and Salsberg, who reported these data, “[b>
eyond these simple divisions, the number of genes in a species bears little relation to its size or to intuitive measures of complexity.” However, the chicken and grape gene counts shown here are “based on draft genomes and may be revised substantially in the future,” they cautioned.
Frankly, I found it quite remarkable that the CHESS compilation revealed that a very large fraction (95%) of transcribed locations in the human genome leads to RNA that arises randomly and apparently does nothing. It is regarded as non-functional “junk.” To me, this is akin to meaningless molecular background “noise” that accompanies otherwise highly regulated expression of functional non-junk RNAs, which collectively lead to controlled cellular growth and differentiation, harmonious molecular “music,” if you will.
This coexistence of molecular “noise” and molecular “music” is worth pondering. What do you think?
As usual, your comments are welcomed.