Third-generation sequencing (also known as long-read sequencing) is a class of DNA sequencing methods currently under active development.
Third generation sequencing works by reading the nucleotide sequences
at the single molecule level, in contrast to existing methods that
require breaking long strands of DNA into small segments then inferring
nucleotide sequences by amplification and synthesis.
Critical challenges exist in the engineering of the necessary molecular
instruments for whole genome sequencing to make the technology
commercially available.
Second-generation sequencing, often referred to as Next-generation sequencing (NGS), has dominated the DNA sequencing space since its development. It has dramatically reduced the cost of DNA sequencing by enabling a massively-paralleled approach capable of producing large numbers of reads at exceptionally high coverages throughout the genome.
Since eukaryotic genomes contain many repetitive regions, a major limitation to this class of sequencing methods is the length of reads it produces. Briefly, second generation sequencing works by first amplifying the DNA molecule and then conducting sequencing by synthesis. The collective fluorescent signal resulting from synthesizing a large number of amplified identical DNA strands allows the inference of nucleotide identity. However, due to random errors, DNA synthesis between the amplified DNA strands would become progressively out-of-sync. Quickly, the signal quality deteriorates as the read-length grows. In order to preserve read quality, long DNA molecules must be broken up into small segments, resulting in a critical limitation of second generation sequencing technologies. Computational efforts aimed to overcome this challenge often rely on approximative heuristics that may not result in accurate assemblies.
By enabling direct sequencing of single DNA molecules, third generation sequencing technologies have the capability to produce substantially longer reads than second generation sequencing. Such an advantage has critical implications for both genome science and the study of biology in general. However, third generation sequencing data have much higher error rates than previous technologies, which can complicate downstream genome assembly and analysis of the resulting data. These technologies are undergoing active development and it is expected that there will be improvements to the high error rates. For applications that are more tolerant to error rates, such as structural variant calling, third generation sequencing has been found to outperform existing methods.
Second-generation sequencing, often referred to as Next-generation sequencing (NGS), has dominated the DNA sequencing space since its development. It has dramatically reduced the cost of DNA sequencing by enabling a massively-paralleled approach capable of producing large numbers of reads at exceptionally high coverages throughout the genome.
Since eukaryotic genomes contain many repetitive regions, a major limitation to this class of sequencing methods is the length of reads it produces. Briefly, second generation sequencing works by first amplifying the DNA molecule and then conducting sequencing by synthesis. The collective fluorescent signal resulting from synthesizing a large number of amplified identical DNA strands allows the inference of nucleotide identity. However, due to random errors, DNA synthesis between the amplified DNA strands would become progressively out-of-sync. Quickly, the signal quality deteriorates as the read-length grows. In order to preserve read quality, long DNA molecules must be broken up into small segments, resulting in a critical limitation of second generation sequencing technologies. Computational efforts aimed to overcome this challenge often rely on approximative heuristics that may not result in accurate assemblies.
By enabling direct sequencing of single DNA molecules, third generation sequencing technologies have the capability to produce substantially longer reads than second generation sequencing. Such an advantage has critical implications for both genome science and the study of biology in general. However, third generation sequencing data have much higher error rates than previous technologies, which can complicate downstream genome assembly and analysis of the resulting data. These technologies are undergoing active development and it is expected that there will be improvements to the high error rates. For applications that are more tolerant to error rates, such as structural variant calling, third generation sequencing has been found to outperform existing methods.
Current technologies
Sequencing
technologies with a different approach than second-generation platforms
were first described as "third-generation" in 2008-2009.
There are several companies currently at the heart of third generation sequencing technology development, namely, Pacific Biosciences, Oxford Nanopore Technology,
Quantapore (CA-USA), and Stratos (WA-USA). These companies are taking
fundamentally different approaches to sequencing single DNA molecules.
PacBio developed the sequencing platform of single molecule real time sequencing (SMRT), based on the properties of zero-mode waveguides.
Signals are in the form of fluorescent light emission from each
nucleotide incorporated by a DNA polymerase bound to the bottom of the
zL well.
Oxford Nanopore’s technology
involves passing a DNA molecule through a nanoscale pore structure and
then measuring changes in electrical field surrounding the pore; while
Quantapore has a different proprietary nanopore approach. Stratos
Genomics spaces out the DNA bases with polymeric inserts, "Xpandomers", to circumvent the signal to noise challenge of nanopore ssDNA reading.
Also notable is Helicos's single molecule fluorescence approach, but the company entered bankruptcy in the fall of 2015.
Advantages
Longer reads
In
comparison to the current generation of sequencing technologies, third
generation sequencing has the obvious advantage of producing much longer
reads. It is expected that these longer read lengths will alleviate
numerous computational challenges surrounding genome assembly,
transcript reconstruction, and metagenomics among other important areas
of modern biology and medicine.
It is well known that eukaryotic genomes including primates and
humans are complex and have large numbers of long repeated regions.
Short reads from second generation sequencing must resort to
approximative strategies in order to infer sequences over long ranges
for assembly and genetic variant calling. Pair end reads
have been leveraged by second generation sequencing to combat these
limitations. However, exact fragment lengths of pair ends are often
unknown and must also be approximated as well. By making long reads
lengths possible, third generation sequencing technologies have clear
advantages.
Epigenetics
Epigenetic markers
are stable and potentially heritable modifications to the DNA molecule
that are not in its sequence. An example is DNA methylation at CpG
sites, which has been found to influence gene expression. Histone
modifications are another example. The current generation of sequencing
technologies rely on laboratory techniques such as ChIP-sequencing
for the detection of epigenetic markers. These techniques involve
tagging the DNA strand, breaking and filtering fragments that contain
markers, followed by sequencing. Third generation sequencing may enable
direct detection of these markers due to their distinctive signal from
the other four nucleotide bases.
Portability and speed
Other important advantages of third generation sequencing technologies include portability and sequencing speed.
Since minimal sample preprocessing is required in comparison to second
generation sequencing, smaller equipments could be designed. Oxford
Nanopore Technology has recently commercialized the MinION sequencer.
This sequencing machine is roughly the size of a regular USB flash
drive and can be used readily by connecting to a laptop. In addition,
since the sequencing process is not parallelized across regions of the
genome, data could be collected and analyzed in real time. These
advantages of third generation sequencing may be well-suited in hospital
settings where quick and on-site data collection and analysis is
demanded.
Challenges
Third
generation sequencing, as it currently stands, faces important
challenges mainly surrounding accurate identification of nucleotide
bases; error rates are still much higher compared to second generation
sequencing.
This is generally due to instability of the molecular machinery
involved. For example, in PacBio’s single molecular and real time
sequencing technology, the DNA polymerase molecule becomes increasingly
damaged as the sequencing process occurs.
Additionally, since the process happens quickly, the signals given off
by individual bases may be blurred by signals from neighbouring bases.
This poses a new computational challenge for deciphering the signals and
consequently inferring the sequence. Methods such as Hidden Markov Models, for example, have been leveraged for this purpose with some success.
On average, different individuals of the human population share
about 99.9% of their genes. In other words, approximately only one out
of every thousand bases would differ between any two person. The high
error rates involved with third generation sequencing are inevitably
problematic for the purpose of characterizing individual differences
that exist between members of the same species.
Genome assembly
Genome assembly is the reconstruction of whole genome DNA sequences. This is generally done with two fundamentally different approaches.
Reference alignment
When
a reference genome is available, as one is in the case of human, newly
sequenced reads could simply be aligned to the reference genome in order
to characterize its properties. Such reference based assembly is quick
and easy but has the disadvantage of “hiding" novel sequences and large
copy number variants. In addition, reference genomes do not yet exist
for most organisms.
De novo assembly
De novo
assembly is the alternative genome assembly approach to reference
alignment. It refers to the reconstruction of whole genome sequences
entirely from raw sequence reads. This method would be chosen when there
is no reference genome, when the species of the given organism is
unknown as in metagenomics, or when there exist genetic variants of interest that may not be detected by reference genome alignment.
Given the short reads produced by the current generation of
sequencing technologies, de novo assembly is a major computational
problem. It is normally approached by an iterative process of finding
and connecting sequence reads with sensible overlaps. Various
computational and statistical techniques, such as de bruijn graphs
and overlap layout consensus graphs, have been leveraged to solve this
problem. Nonetheless, due to the highly repetitive nature of eukaryotic
genomes, accurate and complete reconstruction of genome sequences in de
novo assembly remains challenging. Pair end reads have been posed as a possible solution, though exact fragment lengths are often unknown and must be approximated.
Hybrid assembly
Long
read lengths offered by third generation sequencing may alleviate many
of the challenges currently faced by de novo genome assemblies. For
example, if an entire repetitive region can be sequenced unambiguously
in a single read, no computation inference would be required.
Computational methods have been proposed to alleviate the issue of high
error rates. For example, in one study, it was demonstrated that de novo
assembly of a microbial genome using PacBio sequencing alone performed
superior to that of second generation sequencing.
Third generation sequencing may also be used in conjunction with
second generation sequencing. This approach is often referred to as
hybrid sequencing. For example, long reads from third generation
sequencing may be used to resolve ambiguities that exist in genomes
previously assembled using second generation sequencing. On the other
hand, short second generation reads have been used to correct errors in
that exist in the long third generation reads. In general, this hybrid
approach has been shown to improve de novo genome assemblies
significantly.
Epigenetic markers
DNA methylation (DNAm) – the covalent modification of DNA at CpG sites resulting in attached methyl groups – is the best understood component of epigenetic
machinery. DNA modifications and resulting gene expression can vary
across cell types, temporal development, with genetic ancestry, can
change due to environmental stimuli and are heritable. After the
discovery of DNAm, researchers have also found its correlation to
diseases like cancer and autism. In this disease etiology context DNAm is an important avenue of further research.
Advantages
The current most common methods for examining methylation state require an assay that fragments DNA before standard second generation sequencing on the Illumina platform. As a result of short read length, information regarding the longer patterns of methylation are lost.
Third generation sequencing technologies offer the capability for
single molecule real-time sequencing of longer reads, and detection of
DNA modification without the aforementioned assay.
Oxford Nanopore Technologies’ MinION
has been used to detect DNAm. As each DNA strand passes through a pore,
it produces electrical signals which have been found to be sensitive to
epigenetic changes in the nucleotides, and a hidden Markov model (HMM) was used to analyze MinION data to detect 5-methylcytosine (5mC) DNA modification. The model was trained using synthetically methylated E. coli
DNA and the resulting signals measured by the nanopore technology. Then
the trained model was used to detect 5mC in MinION genomic reads from a
human cell line which already had a reference methylome. The classifier
has 82% accuracy in randomly sampled singleton sites, which increases
to 95% when more stringent thresholds are applied.
Other methods address different types of DNA modifications using
the MinION platform. Stoiber et al. examined 4-methylcytosine (4mC) and
6-methyladenine (6mA), along with 5mC, and also created a software to
directly visualize the raw MinION data in human-friendly way. Here they found that in E. coli, which has a known methylome,
event windows of 5 base pairs long can be used to divide and
statistically analyze the raw MinION electrical signals. A
straightforward Mann-Whitney U test can detect modified portions of the E. coli sequence, as well as further split the modifications into 4mC, 6mA or 5mC regions.
It seems likely that in the future, MinION raw data will be used to detect many different epigenetic marks in DNA.
PacBio
sequencing has also been used to detect DNA methylation. In this
platform the pulse width - the width of a fluorescent light pulse -
corresponds to a specific base. In 2010 it was shown that the interpulse
distance in control and methylated samples are different, and there is a
"signature" pulse width for each methylation type. In 2012 using the PacBio platform the binding sites of DNA methyltransferases were characterized. The detection of N6-methylation in C Elegans was shown in 2015. DNA methylation on N6-adenine using the PacBio platform in mouse embryonic stem cells was shown in 2016.
Other forms of DNA modifications – from heavy metals, oxidation,
or UV damage – are also possible avenues of research using Oxford
Nanopore and PacBio third generation sequencing.
Drawbacks
Processing
of the raw data – such as normalization to the median signal – was
needed on MinION raw data, reducing real-time capability of the
technology.
Consistency of the electrical signals is still an issue, making it
difficult to accurately call a nucleotide. MinION has low throughput;
since multiple overlapping reads are hard to obtain, this further leads
to accuracy problems of downstream DNA modification detection. Both the
hidden Markov model and statistical methods used with MinION raw data
require repeated observations of DNA modifications for detection,
meaning that individual modified nucleotides need to be consistently
present in multiple copies of the genome, e.g. in multiple cells or
plasmids in the sample.
For the PacBio platform, too, depending on what methylation you
expect to find, coverage needs can vary. As of March 2017, other
epigenetic factors like histone modifications have not been discoverable
using third-generation technologies. Longer patterns of methylation are
often lost because smaller contigs still need to be assembled.
Transcriptomics
Transcriptomics is the study of the transcriptome, usually by characterizing the relative abundances of messenger RNA molecules the tissue under study. According to the central dogma of molecular biology,
genetic information flows from double stranded DNA molecules to single
stranded mRNA molecules where they can be readily translated into
function protein molecules. By studying the transcriptome, one can gain
valuable insight into the regulation of gene expressions.
While expression levels as the gene level can be more or less
accurately depicted by second generation sequencing, transcript level
information is still an important challenge.
As a consequence, the role of alternative splicing in molecular biology
remains largely elusive. Third generation sequencing technologies hold
promising prospects in resolving this issue by enabling sequencing of
mRNA molecules at their full lengths.
Alternative splicing
Alternative splicing
(AS) is the process by which a single gene may give rise to multiple
distinct mRNA transcripts and consequently different protein
translations.
Some evidence suggests that AS is a ubiquitous phenomenon and may play a
key role in determining the phenotypes of organisms, especially in
complex eukaryotes; all eukaryotes contain genes consisting of introns
that may undergo AS. In particular, it has been estimated that AS occurs
in 95% of all human multi-exon genes.
AS has undeniable potential to influence myriad biological processes.
Advancing knowledge in this area has critical implications for the study
of biology in general.
Transcript reconstruction
The
current generation of sequencing technologies produce only short reads,
putting tremendous limitation on the ability to detect distinct
transcripts; short reads must be reverse engineered into original
transcripts that could have given rise to the resulting read
observations.
This task is further complicated by the highly variable expression
levels across transcripts, and consequently variable read coverages
across the sequence of the gene. In addition, exons may be shared among individual transcripts, rendering unambiguous inferences essentially impossible.
Existing computational methods make inferences based on the
accumulation of short reads at various sequence locations often by
making simplifying assumptions. Cufflinks takes a parsimonious approach, seeking to explain all the reads with the fewest possible number of transcripts. On the other hand, StringTie attempts to simultaneously estimate transcript abundances while assembling the reads. These methods, while reasonable, may not always identify real transcripts.
A study published in 2008 surveyed 25 different existing transcript reconstruction protocols.
Its evidence suggested that existing methods are generally weak in
assembling transcripts, though the ability to detect individual exons
are relatively intact.
According to the estimates, average sensitivity to detect exons across
the 25 protocols is 80% for Caenorhabditis elegans genes.
In comparison, transcript identification sensitivity decreases to 65%.
For human, the study reported an exon detection sensitivity averaging to
69% and transcript detection sensitivity had an average of mere 33%. In other words, for human, existing methods are able to identify less than half of all existing transcript.
Third generation sequencing technologies have demonstrated
promising prospects in solving the problem of transcript detection as
well as mRNA abundance estimation at the level of transcripts. While
error rates remain high, third generation sequencing technologies have
the capability to produce much longer read lengths. Pacific Bioscience has introduced the iso-seq platform, proposing to sequence mRNA molecules at their full lengths.
It is anticipated that Oxford Nanopore will put forth similar
technologies. The trouble with higher error rates may be alleviated by
supplementary high quality short reads. This approach has been
previously tested and reported to reduce the error rate by more than 3
folds.
Metagenomics
Metagenomics is the analysis of genetic material recovered directly from environmental samples.
Advantages
The main advantage for third-generation sequencing technologies in metagenomics
is their speed of sequencing in comparison to second generation
techniques. Speed of sequencing is important for example in the clinical
setting (i.e. pathogen identification), to allow for efficient diagnosis and timely clinical actions.
Oxford Nanopore's MinION was used in 2015 for real-time
metagenomic detection of pathogens in complex, high-background clinical
samples. The first Ebola virus (EBV) read was sequenced 44 seconds after data acquisition.
There was uniform mapping of reads to genome; at least one read mapped
to >88% of the genome. The relatively long reads allowed for
sequencing of a near-complete viral genome to high accuracy (97–99%
identity) directly from a primary clinical sample.
A common phylogenetic marker for microbial community diversity studies is the 16S ribosomal RNA gene. Both MinION and PacBio's SMRT platform have been used to sequence this gene. In this context the PacBio error rate was comparable to that of shorter reads from 454 and Illumina's MiSeq sequencing platforms.
Drawbacks
MinION's high error rate (~10-40%) prevented identification of antimicrobial resistance markers, for which single nucleotide resolution is necessary. For the same reason eukaryotic pathogens were not identified.
Ease of carryover contamination when re-using same the flow cell
(standard wash protocols don’t work) is also a concern. Unique barcodes
may allow for more multiplexing. Furthermore, performing accurate
species identification for bacteria, fungi and parasites is very difficult, as they share a larger portion of the genome, and some only differ by <5 nbsp="" p="">
The per base sequencing cost is still significantly more than
that of MiSeq. However, the prospect of supplementing reference
databases with full-length sequences from organisms below the limit of
detection from the Sanger approach; this could possibly greatly help the identification of organisms in metagenomics.
Before third generation sequencing can be used reliably in the
clinical context, there is a need for standardization of lab protocols.
These protocols are not yet as optimized as PCR methods.
5>