A Comprehensive Landscape of Transcription Errors in Cells

Posted in: Nucleic Acids

  • Circular Sequencing (CirSeq) Finds Transcription Errors in RNA
  • CirSeq Found >100-Fold More Errors in RNA Compared to Replication Errors in DNA
  • CirSeq Can Elucidate How “Molecular Noise” Affects Cellular Function

Biological reactions are remarkably precise. For example, enzymatic proteins have the amazing ability to not only selectively bind to only the correct substrates from among complex mixtures of countless molecules, but also to do so at the right time and location. This precision is especially important in the context of DNA replication (DNA →DNA), transcription (DNA →RNA), and translation (RNA →protein), as depicted here. These fundamental processes involving nucleic acids are collectively referred to as The Central Dogma of Molecular Biology, which are principles attributed to the early writings of Francis Crick in 1958.

Together, these three processes preserve the integrity of our genome and ensure the faithful expression of our genetic code. However, all chemical and biochemical reactions are imperfect, which is to say that untoward reactions necessarily occur, even if only very infrequently. As a result, numerous studies have investigated the mechanisms that control the fidelity of DNA replication and translation; however, technical limitations have greatly handicapped efforts to investigate the fidelity of transcription. Unlike genetic mutations in DNA, transcription errors in RNA are transient, and are not stably inherited from cell to cell as is DNA. This transient nature of RNA errors makes them difficult to detect.

Conceptually, single-cell single-molecule sequencing of fragmented mRNA could be employed to analyze transcription errors; however, this approach faces multiple technical barriers. Chief among these is that the HeliScope, which was the first commercially available instrument reported for direct RNA sequencing (DRS), is no longer provided by Helicos Biosciences, as the company went out of business in 2012. In the DRS method, as shown elsewhere, poly‐adenylated and 3′blocked RNA is captured on surfaces containing covalently bound poly(dT) oligonucleotide with the 3′end facing “up.” Subsequent cycles of unblocking and extension with reversible terminators are, however, plagued by very short reads (~25 bases) and high error rates (3 – 5%), based on reported cDNA data.

Nanopore-based DRS is now possible by methods that can be read here. However, based on comments reported by Garalde et al., this nanopore approach also exhibits unacceptably high error rates.

An alternative way to measure the fidelity of transcription involves reverse-transcribing RNA into complementary DNA (cDNA), followed by conventional sequencing of cDNA. However, a crucial drawback of this strategy is that reverse transcriptase enzymes—mainly derived from viruses such as retroviridae depicted below—are expected to make one error every ~10,000 to 30,000 bases. Since RNA polymerases are expected to make one error every ~300,000 bases, a standard cDNA library will always be dominated by reverse transcription errors that mask the errors made by RNA polymerases.

One solution to this problem is to reverse-transcribe the same mRNA molecule multiple times. For example, if multiple cDNA copies were made of a single mRNA molecule, then a true transcription error would be present at the same location in every cDNA copy of this molecule, whereas a reverse transcriptase error would appear in only one of these copies. This is basic idea behind the “circle-sequencing” (CirSeq) assay reported by Acevedo and Andino in 2014. The assay’s name derives from the key step in the CirSeq protocol: circularization of mRNA.

As depicted here, transcription errors are identified by producing mRNA fragments, circularizing these with a ligase, and reverse transcribing the RNA circles into cDNAs for a DNA polymerase-mediated rolling-circle reaction. The resultant linear cDNA molecules are comprised of tandem repeats of the original RNA fragments. During this step, artifactual mutations may arise in the cDNA. The cDNA is then processed to generate a library, amplified, and sequenced. During this process, further artifacts may arise. However, because these artifacts are only present in one copy of the tandem repeats, they can be distinguished from true transcription errors, which are present in all tandem repeats.

Adapted from Kuznetsova et al.Nucleic Acids Res.45, 5487–5500 (2017). Open Access.

Polio virus

In 2014, Acevedo et al. reported use of CirSeq to characterize mutations in proteins derived from transcription errors in poliovirus, which is depicted here. These researchers stated that their study “provides the first single-nucleotide fitness landscape of an evolving RNA virus and establishes a general experimental platform for studying the genetic changes underlying the evolution of virus populations.” The importance of this 2014 publication in prestigious Natureis evidenced by more than 200 citations in Google Scholar over a period of less than five years.

In 2017, Gout et al. reported numerous modifications to the original CirSeq assay that streamlined the protocol, increased its sensitivity, and designed a customized bioinformatic pipeline to identify transcription errors. Methodological details for these improvements go far beyond the scope of this blog, so interested readers will need to consult the full report by Gout et al., as this blog will only address key findings.

Key Findings Using CircSeq

S. cerevisiae

Gout et al. screened >8.5 billion bases of the entire transcriptome of Saccharomyces cerevisiae (S. cerevisiae) and found >200,000 transcription errors in eight unique cell lines. Previous efforts have detected only ~100 transcription errors in eukaryotic cells, i.e. organisms consisting of a cell or cells in which the genetic material is DNA in the form of chromosomes contained within a distinct nucleus. Consequently, these results reported by Gout et al. represent the first comprehensive analysis of the fidelity of transcription in a eukaryotic organism.

Importantly, the errors detected by Gout et al. were distributed across the entire transcriptome of S. cerevisiae, indicating that the CircSeq approach provides a genome-wide view of transcriptional mutagenesis in yeast.

Errors were found along the entire length of transcripts, indicating that they affect every aspect of RNA functionality, including the location of the start and stop codon, the stability of secondary structures, and the information that is encoded in the primary sequence. Thus, transcription errors affect every aspect of protein structure and function, including residues for post-translational modifications, catalysis, substrate binding, and structural integrity.

Gout et al. found that, on average, the yeast transcriptome contains ~4.0 errors per million base pairs, which demonstrated that transcription errors occur >100-fold more frequently than DNA replication errors. However, these errors are not distributed equally over the transcriptome. Molecules of mRNA contain the least amount of errors (3.9 × 10−6per base pair), and are synthesized by RNA polymerase II (RNAPII), which is a 550-kDa complex of 12 subunits required for binding to upstream gene promoters to start transcription, as depicted here.

Space-filling model of RNAPII. The structure of yeast RNAPII was solved by Stanford University Prof. Roger Kornberg, who was awarded the Nobel Prize in Chemistry in 2006 for his studies of the process by which genetic information from DNA is copied to RNA.

In terms of increasing error rate, ribosomal RNA molecules synthesized by RNAPI (4.3 × 10−6per base pair), mitochondrial RNA (9.3 × 10−6per base pair), and RNA molecules associated with “housekeeping” genes synthesized by RNAPIII (1.7 × 10−5per base pair) closely follow RNAPII-derived RNA. Gout et al. said that these results suggest that each RNA polymerase has its own unique error rate, as has been observed for DNA polymerases.

On the other hand, within a class of transcripts, the error rate was remarkably constant. For example, the error rate of transcripts synthesized by RNAPII is independent of the expression level of a gene, its distance from an origin of replication, or the position of a base along the length of the gene. In addition, Gout et al. found that bases that are known to be subject to RNA modifications did not display an increased error rate, although they did detect a significant decrease in the coverage of these bases, indicating that they are not efficiently reverse-transcribed and are thus underrepresented.

Related Findings Using Single-Molecule Real-Time (SMRT) Sequencing

These latter observations on modified bases in RNA serve as my transition into discussing recent studies by Potapov et al., who employed Pacific Biosciences (PacBio) SMRT sequencing, which I have previously blogged about. Potapov et al. used SMRT to measure the fidelity of incorporation and replication of modified ribonucleotides such as N6-methyladenosine (m6A), 5-methylcytidine (m5C), 5-hydroxymethylcytidine (hm5C), pseudouridine (Ψ), and inosine (I).

Taken from Potapov et al.Nucleic Acids Res. 46, 5753-5763 (2018). Open Access.

As depicted here,T7 RNA polymerase was used to synthesize base-modified RNA from nucleotide triphosphate pools wherein A, G, C or U was replaced with a corresponding nucleotide analog selected from m6A, m5C, hm5C, Ψ, I, 5-methyluridine (m5U), or 5-hydroxymethyluridine (hm5U). I am pleased to say that all of these reagents were obtained as modified nucleotides from TriLink. After synthesis of modified RNA, first and second strand cDNA was synthesized by reverse transcription, and the resultant double-stranded DNA product was converted into a circular template for SMRT sequencing.

Interested readers should consult Potapov et al. for an explanation on how and what RNA polymerase and reverse transcriptase error rates were determined from bioinformatic analysis. Qualitatively, for the C5 position of uracil, a relatively small methyl group had minimal effect on RNA polymerase incorporation and reverse transcriptase replication fidelity. Increasing the size of the methyl group by adding a hydroxyl group increased first strand errors. Pseudouridine, which contains a secondary amine at the equivalent C5 position in uracil, did not affect reverse transcriptase fidelity, but instead produced substitution errors more frequently during RNA synthesis by T7 RNA polymerase. Misincorporation errors can have implications for pseudouridine-modified RNA-based therapeutics.


Potapov et al. concluded that, with the methodology they describe, it will be possible to define the transcriptional component of nongenetic mutations for the first time and to understand how this “molecular noise” affects cellular function. These investigators believe that their “experiments open up a new field of mutagenesis to widespread experimentation,” and add that one of the most challenging aspects of this field will be to define the impact of transcription errors on cellular health.

According to Potapov et al., the data suggests that transcription errors are particularly detrimental to cellular proteostasis, which is a portmanteau of the words protein and homeostasis, and refers to the concept that there are competing and integrated biological pathways within cells that control the biogenesis, folding, trafficking and degradation of proteins present within and outside the cell.

For example, according to these researchers, in patients that suffer from nonfamilial cases of Alzheimer’s disease, transcription errors can generate toxic versions of the amyloid precursor protein, whereas similar errors generate mutated versions of the ubiquitin-B protein. In both cases, these errors occur on tracts of GA dinucleotide repeats that are present in the coding regions of the affected genes. These observations suggest that transcription errors can directly contribute to human pathology if they occur repeatedly at the same location.

In addition to these highly specific transcription errors, Potapov et al. note that it has long been suspected that a much larger population of errors may exist. These errors have evaded detection because they occur randomly throughout the genome. The investigators believe that their experiments now confirm this suspicion and describe the “landscape of these errors” in detail. To me, this landscape of errors is visually akin to the digital picture of a landscape with erroneous pixels shown here.

Potapov et al. conclude the following:

“Because transcription errors are ubiquitous throughout the genome and can affect any gene at any location, we suspect that the molecular noise created by these errors could be substantial. An important challenge in the future will be to connect these errors directly to the changes in cellular function and monitor their effect on cellular health. We anticipate that these experiments will ultimately lead to the discovery of a wide range of unexpected phenomena, including new mutagens, new mutational mechanisms, and new disease processes that could help us understand how the environment and our lifestyle choices affect our overall health, as well as our predisposition to diseases that are caused by protein aggregation.”

I fully agree with this concluding perspective for the future, and look forward to learning much more about how this “molecular noise” affects cellular function.

As usual, I welcome your comments.

4 years ago
31 view(s)
Did you like this post?