- Enables Sequencing of Various Modified Bases in RNA
- Examples Are 7-Methylguanosine, Pseudouridine, N6-Methyladenosine, and 5-Methylcytosine in COVID-19
- Applications Include Transcriptomics and Epitranscriptomics of RNA Viruses
Historical Introduction
Although RNA sequencing is widely employed, it typically involves enzymatic conversion of the RNA of interest into complementary DNA (cDNA), which is then sequenced. By contrast, direct RNA sequencing (aka native RNA sequencing) is exactly what it sounds like – the direct sequencing the RNA of interest, without a transcription step.
Historically, the Gilbert lab at Harvard first reported direct RNA sequencing in 1977. This was closely followed by a similar report from the MRC Laboratory of Molecular Biology in Cambridge. Both approaches used gel electrophoresis to separate radiolabeled RNA fragments, as illustrated here. Two decades later, following the advent of improved mass spectrometry (MS) technologies, Hahner et al. published MS-based sequencing of RNA digests, a method that was later improved upon by several other groups throughout the early 2000s, as discussed elsewhere.
In 2011, Ozsolak et al. at Helicos Biosciences reported massively parallelized single-molecule sequencing of RNA using fluorescently labeled terminators, a method that achieved much higher throughput. However, the availability of this breakthrough technology was short lived, as Helicos went out of business in 2012. Fortunately, single-molecule sequencing of DNA by use of a biological nanopore of the type shown here was commercialized by Oxford Nanopore Technologies (ONT) shortly afterwards, in 2015. The ONT platform has been recently extended to direct RNA sequencing, which is the subject of this blog (see Footnote).
Crystal structure of a biological nanopore (data credit: Song, et al. 1996. Science 274, 1859). Top view (L) and side view (R). Taken from commons.wikimedia.org and free to use.
Direct RNA Sequencing by Use of Nanopores
Nematode in water.
Roach et al. recently reported direct RNA sequencing of the transcriptome of a widely studied model organism, the nematode worm (Caenorhabditis elegans, shown here). By leveraging the long reads spanning the full length of mRNA transcripts, these researchers were able to provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. Moreover, of the isoforms identified, 3452 are novel splice isoforms not present in the WormBase, and 16,342 isoforms are in the 3' untranslated region (3' UTR), of which 2640 3’ UTR isoforms are novel. It was also determined that poly(A) tail lengths of transcripts vary across development, and correlate with known expression levels.
An important feature of direct sequencing of RNA unique to nanopores is the applicability, in principle, to post-transcriptional RNA modifications (aka epitranscriptome), which are otherwise “lost” during conversion to cDNA for sequencing. Examples of this key attribute are given in the next section; however, let us now consider how this is achieved with reverse transcription (RT) and without.
Taken from Workman et al. in bioRxiv, posted online November 9, 2018, and free to use. The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
The strategy used by Workman et al. to isolate and sequence native poly(A) RNA is shown here. Briefly, total RNA was isolated from human B-lymphocyte cells using conventional extraction, followed by bead-based poly(A) selection. The poly(A) isolate was adapted for nanopore sequencing by: i) attaching proprietary ONT adapters to the poly(A) RNA (red) using T4 ligase; ii) generating poly(A) RNA/DNA duplexes by RT; iii) ligating the adapted poly(A) RNA strand to a second proprietary ONT adapter bearing an RNA motor protein (yellow); and iv) loading the adapted poly(A) RNA onto individual MinION flow cells for sequencing using a standard ONT protocol.
Conversion of the resultant ionic current (pA)-blockage vs. time trace (aka “squiggles”) into RNA sequence is discussed in the next example. According to Workman et al., generation of the cDNA complementary strand is not required; however, RT was performed to improve throughput. In this example, 9.9 million aligned sequence reads were generated. These native RNA reads had a high-quality (N50) aligned length of 1,294 bases, and a maximum aligned length of over 21,000 bases. A total of 78,199 high-confidence isoforms were identified by combining long nanopore reads with short higher accuracy Illumina reads.
In 2019, Smith et al. described direct MinION nanopore sequencing of individual, full-length 16S ribosomal RNA (rRNA) from E. coli without RT, and were able to identify 7-methylguanosine and pseudouridine modifications in rRNA. As depicted here in panel a, the E. coli rRNA library preparation for MinION sequencing began with isolation of 16S rRNA from total RNA by denaturing polyacrylamide gel electrophoresis. A 5’-phosphorylated (P) 16S rRNA-specific adapter was then hybridized and ligated to the 16S rRNA 3’-OH end. Next, a sequencing adapter bearing an RNA motor protein (green) was hybridized and ligated to the 3’ overhang of the 16S rRNA adapter. The sample was then loaded into the MinION flow cell for sequencing.
Taken from Smith et al. (2019) PLoS ONE 14 (5): e0216709. © 2019 Smith et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Panel b depicts a representative ionic current vs. time trace during translocation of an individual 16S rRNA strand from this E. coli library through a nanopore. Upon capture of the 3’ end of an adapted 16S rRNA, the ionic current transitions from an open channel (310 pA; gold arrow in upper right) to a series of discrete segments characteristic of the adapters (inset). This is followed by ionic current segments corresponding to base-by-base translocation of the 16S rRNA. According to Smith et al., this trace is representative of thousands of reads collected for individual 16S rRNA strands from E. coli.
The details on how the four canonical (A, G, C and U) and non-canonical (modified) bases of RNA are differentiated by nanopore sequencing are quite complex. This differentiation “remains a challenge,” according to a recent report by Lorenz et al., and is part of an Open Access review by Xu and Seki titled Recent advances in the detection of base modifications using the Nanopore sequencer, published in October 2019.
N6-methyladenosine. Taken from commons.wikimedia.org and free to use.
Despite these technical challenges, Lorenz et al. have been able to expand the list of identifiable modified bases to include N6-methyladenosine sites in RNA. These investigators at the University of California, San Diego note that “[o>
ne of the most common modifications in the eukaryotic transcriptome is N6-methyladenosine (m6A, shown here), which is found in most classes of RNA, including mRNA, ncRNA, rRNA, and tRNAs.” They add that evidence to date has demonstrated that “m6A plays important roles in nearly every aspect of biology from yeast to mammals.” For these reasons, nanopore sequencing of m6A in RNA is highly desirable.
Using ONT’s direct RNA sequencing technology, Lorenz et al. developed a machine learning technique (Random Forest classifier) that was trained using experimentally detected m6A sites within mRNA DRACH motifs, in which D = A, G or U; R = A or G; H = A, C or U) consensus sequences. The resultant software called MINES (m6A Identification using Nanopore Sequencing) was used to assign m6A methylation status to over 13,000 previously unannotated DRACH sites in endogenous human embryonic kidney cell transcripts, and identified over 40,000 sites with isoform-level resolution in a human mammary epithelial cell line. These sites displayed sensitivity to the m6A “writer” enzyme, METTL3, and “eraser” enzyme, ALKHBH5. MINES thus enabled long-read direct RNA-seq to m6A annotation at single coordinate-level resolution.
Applications of Nanopore-Based Direct Sequencing of RNA
This section provides examples of how nanopore sequencing of RNA has been recently applied to different areas of research. The first example involves a report by Semmouri et al., who investigated gene expression of microbes (aka metatranscriptomics) for possible seasonal differences in a marine zooplankton community. According to these researchers, the implementation of cost effective monitoring programs for zooplankton remains challenging due to the required taxonomical expertise, as well as the high costs of sampling and species identification. To reduce costs, Semmouri et al. evaluated the construction of a metatranscriptome dataset from a shrimp-like crustacean zooplankton community, similar to the one shown here.
Zooplankton were sampled in a specific marine station in the North Sea, both in the winter and summer, and the ONT platform was used to sequence RNA directly. This approach to metatranscriptomics is capable of species detection and can screen for the presence of endoparasites, thus competing with morphological identification. Taxonomic analysis based on ribosomal 18S transcripts identified calanoid copepods (pictured here) as the most abundant community members.
The most abundant mRNA transcripts with known function coded for essential metabolic processes. Genes involved in glycolytic and translation-related processes were most expressed in the community. Semmouri et al. concluded that, although small in scale, their study provides the basis for other types of metatranscriptomic biomonitoring programs.
Structure and genome of HIV. Taken from Wikipedia and free to use.
Another example of direct RNA sequencing by use of nanopores was reported by Gener and Kimata, who achieved full-coverage native RNA sequencing of human immunodeficiency virus type 1 (HIV-1) genomes (depicted here). HIV-1 is an RNA virus and the causative agent for AIDS disease, which is estimated to have afflicted 38 million people worldwide in 2018. HIV-1 strains are known to have high sequence variability due to error-prone replication. According to Gener and Kimata, “lack of full-length sequencing data has limited our understanding of HIV biology. The closest we have come to being able to observe the information contained in the HIV-1 viral genome directly has been to stitch together short read data into quasispecies, which are neither real nor direct observations of these viruses.”
Most approaches to sequencing the HIV-1 viral genome have used some variant of reverse transcription to make double-stranded cDNA, usually followed by PCR amplification, and finally by cDNA sequencing (classic RNA-seq). However, according to Gener and Kimata, DNA-based sequencing methods cannot differentiate between reads from infectious virion RNA, integrated proviral DNA, and non-integrated forms. To address this problem, the researchers applied ONT’s nanopore technology to investigate a novel approach – the possibility of direct sequencing of the entire HIV-1 viral RNA genome with one full-length read of native virion-associated RNA.
Fifteen HIV-1 strains were processed with Direct RNA Sequencing library kits and sequenced on MinION devices. Raw reads were converted to FASTQ-formatted files, aligned to reference sequences, and assembled into contigs. Multi-sequence alignments of the contigs were generated and used for cladistics analysis. For 3 out of 15 isolates, full-length HIV-1 was sequenced from the transcriptional start site to 3’ LTR, which is 100% of the virion genome depicted here. Despite the strong 3’ bias, read coverage was sufficient for the evaluation of single-nucleotide variants, insertions and deletions in 9 isolates, and for the assembly of HIV-1 genomes directly from viral RNA, with a maximum of 94% assembly coverage for one isolate.
The HIV-1 genome has a size of ca. 10000 base pairs and consists of nine genes, some of which are overlapping. Taken from commons.wikimedia.org and free to use.
Human cytomegalovirus (HCMV), pictured here, has been investigated by direct RNA by Balázs et al. This virus is a ubiquitous betaherpesvirus that causes mononucleosis-like symptoms in adults and severe life-threatening infections in newborns. Although it is classified as a DNA herpesvirus, it has been shown that the HCMV virion contains not only DNA, but also four species of mRNA, indicating that this virus is more complex than previously believed.
Histopathology of HCMV infection of a lung pneumocyte. The central cell displays the dramatically enlarged nuclei characteristic of HCMV. Image credit: CDC/Dr. Edwin P. Ewing, Jr. (PHIL #958), 1982. Taken from commons.wikimedia.org and free to use.
The pandemic of severe acute respiratory syndrome 2 (SARS-CoV-2) is associated with a positive-sense single-stranded RNA (ssRNA) virus (COVID-19, depicted here) belonging to the Coronaviridae family. On April 3rd, 2020, Taiaroa et al. in Australia posted an online preprint reporting “the first native RNA sequence of SARS-CoV-2.” This publication details the coronaviral transcriptome and epitranscriptome, which include 42 positions with predicted 5-methylcytosine modifications appearing at consistent positions between sub-genomic mRNAs. In other positive ssRNA viruses, this RNA methylation can change dynamically during the course of infection, influencing host-pathogen interaction and viral replication, according to Taiaroa et al.
The researchers state that “other modifications may become apparent once training datasets are available for direct RNA sequence data,” which is important because “little [is>
known of the epitranscriptomic landscape of coronaviruses.” They offer native RNA sequence-based “inference of viral genetic features and evolutionary rate,” and suggest that this type of rapid online sharing of sequence information throughout the SARS-CoV-2 pandemic “represents an inflection point for public health and genomic epidemiology, providing early insights into the biology and evolution of this emerging pathogen.” The Zone concurs and notes that, as of April 3rd, 2020, ONT tweeted that its nanopore systems were already being used to sequence COVID-19 in 30 countries. On June 24,2020, Wang et al. published a promising report titled Nanopore Targeted Sequencing for the Accurate and Comprehensive Detection of SARS‐CoV‐2 and Other Respiratory Viruses.
In addition, Peter Thielen and Thomas Mehoke, molecular biologists at Johns Hopkins University, are nanopore sequencing the genomes of different variations of SARS-CoV-2 in order to track its mutations as it spreads. However, a publication posted on April 4th, 2020, by Lu et al. titled Genomic epidemiology of SARS-CoV-2 in Guangdong Province, China has cautioned that their sequencing results “suggest that early phylogenetic analyses of the pandemic should be interpreted carefully. The number of mutations that define phylogenetic lineages are small (often one), and may be similar to the number of sequence differences arising from errors...”
Concluding Comments
Although this blog has focused on the advent of direct sequencing of RNA by use of nanopores uniquely extending to RNA modifications in the epitranscriptome, there has been exponential adoption of nanopore sequencing for both DNA and cDNA. This is evident from the chart of publications found in PubMed that contain the phrase “nanopore sequencing” anywhere in the article.
PubMed search and chart by Jerry Zon.
Those interested in reading the first of these publications, which appeared in 1999, can use this link to Akeson et al., who showed that the nanopore current blockades caused by homopolymers of RNA are readily distinguished from one another based on blockade amplitude or blockade kinetics. In the concluding discussion section of this report, preceded by their similar studies of DNA several years earlier, the authors presciently state the following:
“[I>
f membrane channels are to provide a direct, high-speed read-out of the sequence of bases in an RNA or DNA polymer, experimental conditions will need to be optimized to distinguish between the individual purine and pyrimidine nucleotides…” Furthermore, “it is apparent that in future nanopore sequencing applications the blockade characteristics produced by an individual nucleotide will have to be read with greater precision than can be achieved at polymer traversal rates exceeding 1 base per 10 ms.” Finally, “[w>
e expect that with such refinements single nucleotide detection may be achievable, thus permitting direct nanopore sequencing of very long individual nucleic acid strands.”
After the combined contributions of many researchers over more than 20 years, this remarkable feat has indeed been achieved. In the future, I expect that the rate of technological progress for nanopore sequencing will accelerate and further enable elucidation of the role of epitranscriptomic RNA modifications in molecular biology and health.
Your comments are welcomed, as usual.
Footnote
Readers should be aware that the term “direct RNA sequencing” is often misused in reference to sequencing cDNA rather than the RNA from which the cDNA is derived from by transcription (for example, see Xu et al.). The descriptor “native RNA sequencing” (for example, see Workman et al.) would therefore seem to be preferred, as it avoids any ambiguity about the actual analyte being sequenced.
It is also noteworthy that there is a recent US patent by Pacific Biosciences (PacBio) describing the use of PacBio’s non-nanopore technology for native RNA sequencing. However, no exemplary data is provided, and I was unable to find any academic publication using a PacBio system for native RNA sequencing.