The heritability of autism is the proportion of differences in expression of autism that can be explained by genetic variation. Autism has a strong genetic basis. Although the genetics of autism are complex, the disorder is explained more by multigene effects than by rare mutations with large effects.
Autism may be influenced by genetics, with studies consistently
demonstrating a higher prevalence among siblings and in families with a
history of autism. This led researchers to investigate the extent to
which genetics contribute to the development of autism. Numerous
studies, including twin studies and family studies, have estimated the
heritability of autism to be around 80 to 90%, indicating that genetic factors play a substantial role in its etiology. Heritability
estimates do not imply that autism is solely determined by genetics, as
environmental factors also contribute to the development of the
disorder.
Studies of twins
from 1977 to 1995 estimated the heritability of autism to be more than
90%; in other words, that 90% of the differences between autistic and
non-autistic individuals are due to genetic effects. When only one identical twin is autistic, the other often has learning or social disabilities. For adult siblings, the likelihood of having one or more features of the broad autism phenotype might be as high as 30%, much higher than the likelihood in controls.
A variety of genetic associations with autism spectrum disorder have been reported. This Manhattan plot shows the statistical significance
(but not necessarily the strength) of each variant in a scan across the
entire genome. The plot is similar to those in published articles.
Though genetic linkage analysis have been inconclusive, many association analyses have discovered genetic variants associated with autism. For each autistic individual, mutations
in many genes are typically implicated. Mutations in different sets of
genes may be involved in different autistic individuals. There may be
significant interactions among mutations in several genes, or between
the environment and mutated genes. By identifying genetic markers
inherited with autism in family studies, numerous candidate genes have been located, most of which encode proteins involved in neural development and function. However, for most of the candidate genes, the actual mutations that increase the likelihood for autism have not been identified. Typically, autism cannot be traced to a Mendelian (single-gene) mutation or to single chromosome abnormalities such as fragile X syndrome or 22q13 deletion syndrome.
10–15% of autism cases may result from single gene disorders or copy number variations (CNVs)—spontaneous alterations in the genetic material during meiosis that delete or duplicate genetic material. These sometimes result in syndromic autism, as opposed to the more common idiopathic autism. Sporadic (non-inherited) cases have been examined to identify candidate genetic loci
involved in autism. A substantial fraction of autism may be highly
heritable but not inherited: that is, the mutation that causes the
autism is not present in the parental genome.
Although the fraction of autism traceable to a genetic cause may
grow to 30–40% as the resolution of array comparative genomic
hybridization (CGH) improves, several results in this area have been described incautiously, possibly
misleading the public into thinking that a large proportion of autism
is caused by CNVs and is detectable via array CGH, or that detecting
CNVs is tantamount to a genetic diagnosis. The Autism Genome Project database contains genetic linkage and CNV data that connect autism to genetic loci and suggest that every human chromosome may be involved. It may be that using autism-related sub-phenotypes instead of the
diagnosis of autism per se may be more useful in identifying susceptible
loci.
Twin studies
Twin
studies provide a unique opportunity to explore the genetic and
environmental influences on autism spectrum disorder (ASD). By studying
identical twins, who share identical DNA, and fraternal twins, who share
about half of their DNA, researchers can estimate the heritability of
autism by comparing the rates of when one twin is diagnosed with autism
while the other is not in identical vs. fraternal twins. Twin studies are a helpful tool in determining the heritability of disorders and human traits in general. They involve determining concordance of characteristics between identical (monozygotic or MZ) twins and between fraternal (dizygotic
or DZ) twins. Possible problems of twin studies are: (1) errors in
diagnosis of monozygocity, and (2) the assumption that social
environment sharing by DZ twins is equivalent to that of MZ twins.
A condition that is environmentally caused without genetic
involvement would yield a concordance for MZ twins equal to the
concordance found for DZ twins. In contrast, a condition that is
completely genetic in origin would theoretically yield a concordance of
100% for MZ pairs and usually much less for DZ pairs depending on
factors such as the number of genes involved and assortative mating.
An example of a condition that appears to have very little if any genetic influence is irritable bowel syndrome (IBS), with a concordance of 28% vs. 27% for MZ and DZ pairs respectively. An
example of a human characteristic that is extremely heritable is eye color, with a concordance of 98% for MZ pairs and 7–49% for DZ pairs depending on age.
Identical twin studies put autism's heritability in a range between 36% and 95.7%, with concordance for a broader phenotype usually found at the higher end of the range. Autism concordance in siblings and fraternal twins is anywhere between 0
and 23.5%. This is more likely 2–4% for classic autism and 10–20% for a
broader spectrum. Assuming a general-population prevalence of 0.1%, the
risk of classic autism in siblings is 20- to 40-fold that of the
general population.
Notable twin studies have attempted to shed light on the heritability of autism.
A small-scale study in 1977 was the first of its kind to look
into the heritability of autism. It involved 10 DZ twins and 11 MZ twins
in which at least one twin in each pair showed infantile autism. It
found a concordance of 36% in MZ twins compared to 0% for DZ twins.
Concordance of "cognitive abnormalities" was 82% in MZ pairs and 10% for
DZ pairs. In 12 of the 17 pairs discordant for autism, a biological
hazard was believed to be associated with the condition.
A 1979 case report discussed a pair of identical twins concordant
for autism. The twins developed similarly until the age of 4, when one
of them spontaneously improved. The other twin, who had had infrequent
seizures, remained autistic. The report noted that genetic factors were
not "all-important" in the development of twins.
In 1985, a study of twins enrolled with the UCLA Registry for
Genetic Studies found a concordance of 95.7% for autism in 23 pairs of
MZ twins, and 23.5% for 17 DZ twins.
In a 1989 study, Nordic countries
were screened for cases of autism. Eleven pairs of MZ twins and 10 of
DZ twins were examined. Concordance of autism was found to be 91% in MZ
and 0% in DZ pairs. The concordances for "cognitive disorder" were 91%
and 30% respectively. In most of the pairs discordant for autism, the
autistic twin had more perinatal stress.
A British twin sample was reexamined in 1995 and a 60%
concordance was found for autism in MZ twins vs. 0% concordance for DZ.
It also found 92% concordance for a broader spectrum in MZ vs. 10% for
DZ. The study concluded that "obstetric hazards usually appear to be
consequences of genetically influenced abnormal development, rather than
independent aetiological factors."
A 1999 study looked at social cognitive skills in the
general-population of children and adolescents. It found "poorer social
cognition in males", and a heritability of 0.68 with higher genetic
influence in younger twins.
In 2000, a study looked at reciprocal social behavior in
general-population identical twins. It found a concordance of 73% for
MZ, i.e. "highly heritable", and 37% for DZ pairs.
A 2004 study looked at 16 MZ twins and found a concordance of
43.75% for "strictly defined autism". Neuroanatomical differences
(discordant cerebellar white and grey matter volumes) between discordant
twins were found. The abstract notes that in previous studies 75% of
the non-autistic twins displayed the broader phenotype.
Another 2004 study examined whether the characteristic symptoms
of autism (impaired social interaction, communication deficits, and
repetitive behaviors) show decreased variance of symptoms among monozygotic
twins compared to siblings in a sample of 16 families. The study
demonstrated significant aggregation of symptoms in twins. It also
concluded that "the levels of clinical features seen in autism may be a
result of mainly independent genetic traits."
An English twin study in 2006 found high heritability for autistic traits in a large group of 3,400 pairs of twins.
One critic of the pre-2006 twin studies said that they were too
small and their results can be plausibly explained on non-genetic
grounds.
In a 2015 meta-analysis of previously conducted twin studies, the
authors found that genetics play a substantial role in the development
of autism, contributing between 64% to 91% to the chances of developing
autism.
In a 2024 study of twins conducted by Martini et al., findings
suggests that genetic factors have a greater influence on the stability
of autistic traits compared to environmental factors.
Sibling studies
A
study of 99 autistic probands which found a 2.9% concordance for autism
in siblings, and between 12.4% and 20.4% concordance for a "lesser
variant" of autism.
A study of 31 siblings of autistic children, 32 siblings of
children with developmental delay, and 32 controls. It found that the
siblings of autistic children, as a group, "showed superior spatial and
verbal span, but a greater than expected number performed poorly on the
set-shifting, planning, and verbal fluency tasks."
A 2005 Danish study looked at "data from the Danish Psychiatric
Central Register and the Danish Civil Registration System to study some
risk factors of autism, including place of birth, parental place of
birth, parental age, family history of psychiatric disorders, and
paternal identity." It found an overall prevalence rate of roughly
0.08%. Prevalence of autism in siblings of autistic children was found
to be 1.76%. Prevalence of autism among siblings of children with Asperger syndrome or PDD
was found to be 1.04%. The risk was twice as high if the mother had
been diagnosed with a psychiatric disorder. The study also found that
"the risk of autism was associated with increasing degree of
urbanisation of the child's place of birth and with increasing paternal,
but not maternal, age."
A study in 2007 looked at a database containing pedigrees of 86
families with two or more autistic children and found that 42 of the
third-born male children showed autistic symptoms, suggesting that
parents had a 50% chance of passing on a mutation to their offspring.
The mathematical models suggest that about 50% of autistic cases are
caused by spontaneous mutations. The simplest model was to divide
parents into two risk classes depending on whether the parent carries a
pre-existing mutation that causes autism; it suggested that about a
quarter of autistic children have inherited a copy number variation from their parents.
Other family studies
A 1994 study looked at the personalities of parents of autistic children, using parents of children with Down syndrome
as controls. Using standardized tests it was found that parents of
autistic children were "more aloof, untactful and unresponsive" compared
to parents whose children did not have autism.
A 1997 study found higher rates of social and communication
deficits and stereotyped behaviors in families with multiple-incidence
autism.
Autism was found to occur more often in families of physicists,
engineers and scientists. 12.5% of the fathers and 21.2% of the
grandfathers (both paternal and maternal) of children with autism were
engineers, compared to 5% of the fathers and 2.5% of the grandfathers of
children with other syndromes. Other studies have yielded similar results. Findings of this nature have led to the coinage of the term "geek syndrome".
A 2001 study of brothers and parents of autistic boys looked into the phenotype
in terms of one current cognitive theory of autism. The study raised
the possibility that the broader autism phenotype may include a
"cognitive style" (weak central coherence) that can confer
information-processing advantages.
A study in 2005 showed a positive correlation between repetitive
behaviors in autistic individuals and obsessive-compulsive behaviors in
parents. Another 2005 study focused on sub-threshold autistic traits in the
general population. It found that correlation for social impairment or
competence between parents and their children and between spouses is
about 0.4.
A 2005 report examined the family psychiatric history of 58 subjects with Asperger syndrome (AS) diagnosed according to DSM-IV criteria. Three (5%) had first-degree relatives with AS. Nine (19%) had a family history of schizophrenia. Thirty five (60%) had a family history of depression. Out of 64 siblings, 4 (6.25%) were diagnosed with AS. According to a 2022 study held on 86 mother-child dyads across 18
months, "prior maternal depression didn’t predict child behavior
problems later."
Twinning risk
It has been suggested that the twinning process itself is a risk factor
in the development of autism, presumably due to perinatal factors. However, three large-scale epidemiological studies have refuted this idea. These studies took place in California, Sweden, and Australia. One study done in Western Australia, utilized the Maternal and Child
Health Research Database that houses birth records for all infants born,
including infants and later children diagnosed with autism spectrum
disorder. During this study, the population analyzed for the incidence
of Autism Spectrum Disorder was restricted to those children with birth
years between 1980 and 1995. The focus was on the incidence of autism
spectrum disorder in the twin population in comparison to the non twin
population. The following two studies, explored the risk of Autism spectrum disorder in the twin population.
The conclusion that the twinning process alone is not a risk factor was
drawn. In these studies the data exemplified that both MZ twins
will have autism spectrum disorder, but only one of the DZ twins will
have autism spectrum disorder with an incidence rate of 90% in MZ twins
compared to 0% in DZ twins. The high symmetry in MZ twins can explain the high symmetry of autism
spectrum disorder in MZ twins outcome compared to DZ twins and non twin
siblings.
Clues to the first two questions come from studies that have shown
that at least 30% of individuals with autism have spontaneous de novo
mutations that occurred in the father's sperm or mother's egg that
disrupt important genes for brain development. These spontaneous
mutations are likely to cause autism in families where there is no
family history of the disorder.
The concordance between identical twins is not quite 100% for two reasons, because these mutations have variable 'expressivity'
and their effects manifest differently due to chance effects,
epigenetic, and environmental factors. Also spontaneous mutations can
potentially occur specifically in one embryo and not the other after
conception. The likelihood of developing intellectual disability is dependent on
the importance of the gene to brain development and how the mutation
changes this function, also playing a role is the genetic and
environmental background upon which a mutation occurs. The recurrence of the same mutations in multiple individuals affected by
autism has led Brandler and Sebat to suggest that the spectrum of
autism is breaking up into quanta of many different genetic disorders.
Single genes
The
most parsimonious explanation for cases of autism where a single child
is affected and there is no family history or affected siblings is that a
single spontaneous mutation that impacts one or multiple genes is a
significant contributing factor. Tens of individual genes or mutations have been definitively identified
and are cataloged by the Simons Foundation Autism Research Initiative.
Examples of autism that has arisen from a rare or de novo mutation in a single-gene or locus include neurodevelopmental disorders like fragile X syndrome; metabolic conditions (for example, propionic acidemia); and chromosomal disorders like 22q13 deletion syndrome and 16p11.2 deletion syndrome.
Deletion (1), duplication (2) and inversion (3) are all chromosome abnormalities that have been implicated in autism.
These mutations themselves are characterized by considerable
variability in clinical outcome and typically only a subset of mutation
carriers meet criteria for autism. For example, carriers of the 16p11.2
deletion have a mean IQ 32 points lower than their first-degree
relatives that do not carry the deletion, however only 20% are below the
threshold IQ of 70 for intellectual disability, and only 20% have
autism. Around 85% have a neurobehavioral diagnosis, including autism, ADHD, anxiety disorders, mood disorders, gross motor delay, and epilepsy, while 15% have no diagnosis. Alongside these neurobehavioral phenotypes, the 16p11.2 deletions /
duplications have been associated with macrocephaly / microcephaly, body
weight regulation, and the duplication in particular is associated with
schizophrenia. Controls that carry mutations associated with autism or schizophrenia
typically present with intermediate cognitive phenotypes or fecundity
compared to neurodevelopmental cases and population controls. Therefore, a single mutation can have multiple different effects depending on other genetic and environmental factors.
Multigene interactions
In
this model, autism often arises from a combination of common,
functional variants of genes. Each gene contributes a relatively small
effect in increasing the risk of autism. In this model, no single gene
directly regulates any core symptom of autism such as social behavior.
Instead, each gene encodes a protein that disrupts a cellular process,
and the combination of these disruptions, possibly together with
environmental influences, affect key developmental processes such as synapse formation. For example, one model is that many mutations affect MET and other receptor tyrosine kinases, which in turn converge on disruption of ERK and PI3K signaling.
Two family types
In
this model most families fall into two types: in the majority, sons
have a low risk of autism, but in a small minority their risk is near
50%. In the low-risk families, sporadic autism is mainly caused by
spontaneous mutation with poor penetrance
in daughters and high penetrance in sons. The high-risk families come
from (mostly female) children who carry a new causative mutation but are
unaffected and transmit the dominant mutation to grandchildren.
Several epigenetic models of autism have been proposed. These are suggested by the occurrence of autism in individuals with
fragile X syndrome, which arises from epigenetic mutations, and with
Rett syndrome, which involves epigenetic regulation factors. An
epigenetic model would help explain why standard genetic screening strategies have so much difficulty with autism.
Genomic imprinting
Genomic imprinting models have been proposed; one of their strengths is explaining the high male-to-female ratio in ASD. One hypothesis is that autism is in some sense diametrically opposite to schizophrenia
and other psychotic-spectrum conditions, that alterations of genomic
imprinting help to mediate the development of these two sets of
conditions, and that ASD involves increased effects of paternally
expressed genes, which regulate overgrowth in the brain, whereas
schizophrenia involves maternally expressed genes and undergrowth.
Environmental interactions
Though
autism's genetic factors explain most of autism risk, they do not
explain all of it. A common hypothesis is that autism is caused by the
interaction of a genetic predisposition and an early environmental
insult. Several theories based on environmental factors have been proposed to
address the remaining risk. Some of these theories focus on prenatal
environmental factors, such as agents that cause birth defects; others
focus on the environment after birth, such as children's diets. All
known teratogens (agents that cause birth defects) related to the risk of autism appear to act during the first eight weeks from conception, strong evidence that autism arises very early in development. Although evidence for other environmental causes is anecdotal and has not been confirmed by reliable studies, extensive searches are underway. A 2015 study has found evidence that non-shared environmental factors has an influence on social impairments in ASD.
Sex bias
Autism spectrum disorder affects all races, ethnicities, and socioeconomic groups. Still, more males than females are affected across all cultures, the ratios of males-to-females is appropriately 3 to 1. A study analyzed the Autism Genetics Resource Exchange (AGRE database),
which holds resources, research, and records of autism spectrum
disorder diagnosis. In this study, it was concluded that when a
spontaneous mutation causes autism spectrum disorder (ASD), there is
high penetrance in males and low penetrance in females. A study published in 2020 explored the reason behind this idea further. It is commonly known that the main difference between males and females
is the fact that males have one X and one Y sex chromosome whereas
female have two X chromosomes. This leads to the idea that there is a
gene on the X chromosome that is not on the Y that is involved with the
sex bias of ASD.
In another study, it has been found that the gene called NLGN4, when mutated, can cause ASD. This gene and other NLGN genes are important for neuron communications. This NLGN4 gene is found on both the X (NLGN4X) and the Y (NLGN4Y) chromosome. The sex chromosomes are 97% identical.It has been determined that most of the mutations that occur are located on the NLGN4X gene.Research into the differences between NLGN4X and NLGN4Y found that the
NLGN4Y protein has poor surface expectations and poor synapses
regulations, leading to poor neuron communication. Researchers concluded that males have a higher incidence of autism when the mechanism is NLGN4X-associated. This association was concluded since females have two X chromosomes; if
there is a mutation in a gene on an X chromosome, the other X
chromosome can be used to compensate for the mutation. Whereas males
have only one X chromosome, meaning that if there is a mutation in a
gene on an X chromosome, then that is the only copy of the gene and it
will be used. The genomic difference between males and females is one
mechanism that leads to the higher incidence of ASD in males.
Candidate gene loci
Known genetic syndromes, mutations, and metabolic diseases account for up to 20% of autism cases. A number of alleles have been shown to have strong linkage to the autism phenotype.
In many cases the findings are inconclusive, with some studies showing
no linkage. Alleles linked so far strongly support the assertion that
there is a large number of genotypes that are
manifested as the autism phenotype.
At least some of the alleles associated with autism are fairly
prevalent in the general population, which indicates they are not rare
pathogenic mutations. This also presents some challenges in identifying
all the rare allele combinations involved in the etiology of autism.
A 2008 study compared genes linked with autism to those of other
neurological diseases, and found that more than half of known autism
genes are implicated in other disorders, suggesting that the other
disorders may share molecular mechanisms with autism.
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
Knowledge of DNA sequences has become indispensable for basic biological research, DNA Genographic Projects and in numerous applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological systematics. Comparing healthy and mutated DNA sequences can diagnose different diseases including various cancers, characterize antibody repertoire, and can be used to guide patient treatment. Having a quick way to sequence DNA allows for faster and more
individualized medical care to be administered, and for more organisms
to be identified and cataloged.
The rapid advancements in DNA sequencing technology have played a
crucial role in sequencing complete genomes of various life forms,
including humans, as well as numerous animal, plant, and microbial
species.
An example of the results of automated chain-termination DNA sequencing
The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional chromatography. Following the development of fluorescence-based sequencing methods with a DNA sequencer, DNA sequencing has become easier and orders of magnitude faster.
Applications
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes, or entire genomes of any organism. DNA sequencing is also the most efficient way to indirectly sequence RNA or proteins (via their open reading frames). In fact, DNA sequencing has become a key technology in many areas of biology and other sciences such as medicine, forensics, and anthropology.
Molecular biology
Sequencing is used in molecular biology
to study genomes and the proteins they encode. Information obtained
using sequencing allows researchers to identify changes in genes and
noncoding DNA (including regulatory sequences), associations with
diseases and phenotypes, and identify potential drug targets.
Evolutionary biology
Since DNA is an informative macromolecule in terms of transmission from one generation to another, DNA sequencing is used in evolutionary biology
to study how different organisms are related and how they evolved. In
February 2021, scientists reported, for the first time, the sequencing
of DNA from animal remains, a mammoth in this instance, over a million years old, the oldest DNA sequenced to date.
The field of metagenomics involves identification of organisms present in a body of water, sewage,
dirt, debris filtered from the air, or swab samples from organisms.
Knowing which organisms are present in a particular environment is
critical to research in ecology, epidemiology, microbiology, and other fields. Sequencing enables researchers to determine which types of microbes may be present in a microbiome, for example.
As most viruses are too small to be seen by a light microscope,
sequencing is one of the main tools in virology to identify and study
the virus. Viral genomes can be based in DNA or RNA. RNA viruses are more
time-sensitive for genome sequencing, as they degrade faster in clinical
samples. Traditional Sanger sequencing
and next-generation sequencing are used to sequence viruses in basic
and clinical research, as well as for the diagnosis of emerging viral
infections, molecular epidemiology of viral pathogens, and drug-resistance testing. There are more than 2.3 million unique viral sequences in GenBank. In 2019, NGS has surpassed traditional Sanger as the most popular approach for generating viral genomes.
During the 1997 avian influenza outbreak, viral sequencing determined that the influenza sub-type originated through reassortment between quail and poultry. This led to legislation in Hong Kong
that prohibited selling live quail and poultry together at market.
Viral sequencing can also be used to estimate when a viral outbreak
began by using a molecular clock technique.
Medicine
Medical
technicians may sequence genes (or, theoretically, full genomes) from
patients to determine if there is risk of genetic diseases. This is a
form of genetic testing, though some genetic tests may not involve DNA sequencing.
As of 2013 DNA sequencing was increasingly used to diagnose and
treat rare diseases. As more and more genes are identified that cause
rare genetic diseases, molecular diagnoses for patients become more
mainstream. DNA sequencing allows clinicians to identify genetic
diseases, improve disease management, provide reproductive counseling,
and more effective therapies. Gene sequencing panels are used to identify multiple potential genetic causes of a suspected disorder.
DNA sequencing may be used along with DNA profiling methods for forensic identification and paternity testing.
DNA testing has evolved tremendously in the last few decades to
ultimately link a DNA print to what is under investigation. The DNA
patterns in fingerprint, saliva, hair follicles, etc. uniquely separate
each living organism from another. Testing DNA is a technique which can
detect specific genomes in a DNA strand to produce a unique and
individualized pattern.
The canonical structure of DNA has four bases: thymine (T), adenine (A), cytosine (C), and guanine
(G). DNA sequencing is the determination of the physical order of these
bases in a molecule of DNA. However, there are many other bases that
may be present in a molecule. In some viruses (specifically, bacteriophage), cytosine may be replaced by hydroxy methyl or hydroxy methyl glucose cytosine.[22] In mammalian DNA, variant bases with methyl groups or phosphosulfate may be found.[23][24] Depending on the sequencing technique, a particular modification, e.g., the 5mC (5-Methylcytosine) common in humans, may or may not be detected.
In almost all organisms, DNA is synthesized in vivo using only
the 4 canonical bases; modification that occurs post replication creates
other bases like 5 methyl C. However, some bacteriophage can
incorporate a non standard base directly.
In addition to modifications, DNA is under constant assault by
environmental agents such as UV and Oxygen radicals. At the present
time, the presence of such damaged bases is not detected by most DNA
sequencing methods, although PacBio has published on this.
History
Discovery of DNA structure and function
Deoxyribonucleic acid (DNA) was first discovered and isolated by Friedrich Miescher in 1869, but it remained under-studied for many decades because proteins,
rather than DNA, were thought to hold the genetic blueprint to life.
This situation changed after 1944 as a result of some experiments by Oswald Avery, Colin MacLeod, and Maclyn McCarty
demonstrating that purified DNA could change one strain of bacteria
into another. This was the first time that DNA was shown capable of
transforming the properties of cells.
In 1953, James Watson and Francis Crick put forward their double-helix model of DNA, based on crystallized X-ray structures being studied by Rosalind Franklin.
According to the model, DNA is composed of two strands of nucleotides
coiled around each other, linked together by hydrogen bonds and running
in opposite directions. Each strand is composed of four complementary
nucleotides – adenine (A), cytosine (C), guanine (G) and thymine (T) –
with an A on one strand always paired with T on the other, and C always
paired with G. They proposed that such a structure allowed each strand
to be used to reconstruct the other, an idea central to the passing on
of hereditary information between generations.
Frederick Sanger, a pioneer of sequencing. Sanger is one of the few scientists who was awarded two Nobel prizes, one for the sequencing of proteins, and the other for the sequencing of DNA.
The foundation for sequencing proteins was first laid by the work of Frederick Sanger who by 1955 had completed the sequence of all the amino acids in insulin,
a small protein secreted by the pancreas. This provided the first
conclusive evidence that proteins were chemical entities with a specific
molecular pattern rather than a random mixture of material suspended in
fluid. Sanger's success in sequencing insulin spurred on x-ray
crystallographers, including Watson and Crick, who by now were trying to
understand how DNA directed the formation of proteins within a cell.
Soon after attending a series of lectures given by Frederick Sanger in
October 1954, Crick began developing a theory which argued that the
arrangement of nucleotides in DNA determined the sequence of amino acids
in proteins, which in turn helped determine the function of a protein.
He published this theory in 1958.
RNA sequencing
RNA sequencing
was one of the earliest forms of nucleotide sequencing. The major
landmark of RNA sequencing is the sequence of the first complete gene
and the complete genome of Bacteriophage MS2, identified and published by Walter Fiers and his coworkers at the University of Ghent (Ghent, Belgium), in 1972 and 1976. Traditional RNA sequencing methods require the creation of a cDNA molecule which must be sequenced.
Early DNA sequencing methods
The first method for determining DNA sequences involved a location-specific primer extension strategy established by Ray Wu, a geneticist, at Cornell University in 1970. DNA polymerase catalysis and specific nucleotide labeling, both of
which figure prominently in current sequencing schemes, were used to
sequence the cohesive ends of lambda phage DNA. Between 1970 and 1973, Wu, scientist Radha Padmanabhan and colleagues
demonstrated that this method can be employed to determine any DNA
sequence using synthetic location-specific primers.
Walter Gilbert, a biochemist, and Allan Maxam, a molecular geneticist, at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation". In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis. Advancements in sequencing were aided by the concurrent development of recombinant DNA technology, allowing DNA samples to be isolated from sources other than viruses.
Two years later in 1975, Frederick Sanger, a biochemist, and Alan Coulson, a genome scientist, developed a method to sequence DNA. The technique
known as the "Plus and Minus" method, involved supplying all the
components of the DNA but excluding the reaction of one of the four
bases needed to complete the DNA.
In 1976, Gilbert and Maxam, invented a method for rapidly
sequencing DNA while at Harvard, known as the Maxam–Gilbert sequencing. The technique involved treating radiolabelled DNA with a chemical and using a polyacrylamide gel to determine the sequence.
In 1977, Sanger then adopted a primer-extension strategy to develop more rapid DNA sequencing methods at the MRC Centre, Cambridge,
UK. This technique was similar to his "Plus and Minus" strategy,
however, it was based upon the selective incorporation of
chain-terminating dideoxynucleotides (ddNTPs) by DNA polymerase during in vitro DNA replication. Sanger published this method in the same year.
Sequencing of full genomes
The 5,386 bp genome of bacteriophage φX174. Each coloured block represents a gene.
The first full DNA genome to be sequenced was that of bacteriophage φX174 in 1977. Medical Research Council scientists deciphered the complete DNA sequence of the Epstein-Barr virus
in 1984, finding it contained 172,282 nucleotides. Completion of the
sequence marked a significant turning point in DNA sequencing because it
was achieved with no prior genetic profile knowledge of the virus.
A non-radioactive method for transferring the DNA molecules of sequencing reaction mixtures onto an immobilizing matrix during electrophoresis was developed by Herbert Pohl and co-workers in the early 1980s. Followed by the commercialization of the DNA sequencer "Direct-Blotting-Electrophoresis-System GATC 1500" by GATC Biotech, which was intensively used in the framework of the EU genome-sequencing programme, the complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome II. Leroy E. Hood's laboratory at the California Institute of Technology announced the first semi-automated DNA sequencing machine in 1986. This was followed by Applied Biosystems' marketing of the first fully automated sequencing machine, the ABI 370, in 1987 and by Dupont's Genesis 2000 which used a novel fluorescent labeling technique enabling all four dideoxynucleotides to be identified in a single lane. By 1990, the U.S. National Institutes of Health (NIH) had begun large-scale sequencing trials on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae at a cost of US$0.75 per base. Meanwhile, sequencing of human cDNA sequences called expressed sequence tags began in Craig Venter's lab, an attempt to capture the coding fraction of the human genome. In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science marked the first published use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts.
By 2003, the Human Genome Project's shotgun sequencing methods
had been used to produce a draft sequence of the human genome; it had a
92% accuracy. In 2022, scientists successfully sequenced the last 8% of the human
genome. The fully sequenced standard reference gene is called
GRCh38.p14, and it contains 3.1 billion base pairs.
High-throughput sequencing (HTS) methods
History of sequencing technology
Several new methods for DNA sequencing were developed in the mid to late 1990s and were implemented in commercial DNA sequencers
by 2000. Together these were called the "next-generation" or
"second-generation" sequencing (NGS) methods, in order to distinguish
them from the earlier methods, including Sanger sequencing.
In contrast to the first generation of sequencing, NGS technology is
typically characterized by being highly scalable, allowing the entire
genome to be sequenced at once. Usually, this is accomplished by
fragmenting the genome into small pieces, randomly sampling for a
fragment, and sequencing it using one of a variety of technologies, such
as those described below. An entire genome is possible because multiple
fragments are sequenced at once (giving it the name "massively
parallel" sequencing) in an automated process.
NGS technology has tremendously empowered researchers to look for
insights into health, anthropologists to investigate human origins, and
is catalyzing the "Personalized Medicine"
movement. However, it has also opened the door to more room for error.
There are many software tools to carry out the computational analysis of
NGS data, often compiled at online platforms such as CSI NGS Portal,
each with its own algorithm. Even the parameters within one software
package can change the outcome of the analysis. In addition, the large
quantities of data produced by DNA sequencing have also required
development of new methods and programs for sequence analysis. Several
efforts to develop standards in the NGS field have been attempted to
address these challenges, most of which have been small-scale efforts
arising from individual labs. Most recently, a large, organized,
FDA-funded effort has culminated in the BioCompute standard.
On 26 October 1990, Roger Tsien,
Pepi Ross, Margaret Fahnestock and Allan J Johnston filed a patent
describing stepwise ("base-by-base") sequencing with removable 3'
blockers on DNA arrays (blots and single DNA molecules). In 1996, Pål Nyrén and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm published their method of pyrosequencing.
On 1 April 1997, Pascal Mayer and Laurent Farinelli submitted patents to the World Intellectual Property Organization describing DNA colony sequencing. The DNA sample preparation and random surface-polymerase chain reaction
(PCR) arraying methods described in this patent, coupled to Roger Tsien
et al.'s "base-by-base" sequencing method, is now implemented in Illumina's Hi-Seq genome sequencers.
In 1998, Phil Green and Brent Ewing of the University of Washington described their phred quality score for sequencer data analysis, a landmark analysis technique that gained widespread adoption, and
which is still the most common metric for assessing the accuracy of a
sequencing platform.
Lynx Therapeutics published and marketed massively parallel signature sequencing
(MPSS), in 2000. This method incorporated a parallelized,
adapter/ligation-mediated, bead-based sequencing technology and served
as the first commercially available "next-generation" sequencing method,
though no DNA sequencers were sold to independent laboratories.
Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based on chemical modification of DNA and subsequent cleavage at specific bases. Also known as chemical sequencing, this method allowed purified samples
of double-stranded DNA to be used without further cloning. This
method's use of radioactive labeling and its technical complexity
discouraged extensive use after refinements in the Sanger methods had
been made.
Maxam-Gilbert sequencing requires radioactive labeling at one 5'
end of the DNA and purification of the DNA fragment to be sequenced.
Chemical treatment then generates breaks at a small proportion of one or
two of the four nucleotide bases in each of four reactions (G, A+G, C,
C+T). The concentration of the modifying chemicals is controlled to
introduce on average one modification per DNA molecule. Thus a series of
labeled fragments is generated, from the radiolabeled end to the first
"cut" site in each molecule. The fragments in the four reactions are
electrophoresed side by side in denaturing acrylamide
gels for size separation. To visualize the fragments, the gel is
exposed to X-ray film for autoradiography, yielding a series of dark
bands each corresponding to a radiolabeled DNA fragment, from which the
sequence may be inferred.
The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the method of choice, owing to its relative ease and reliability. When invented, the chain-terminator method used fewer toxic chemicals
and lower amounts of radioactivity than the Maxam and Gilbert method.
Because of its comparative ease, the Sanger method was soon automated
and was the method used in the first generation of DNA sequencers.
Sanger sequencing is the method which prevailed from the 1980s
until the mid-2000s. Over that period, great advances were made in the
technique, such as fluorescent labelling, capillary electrophoresis, and
general automation. These developments allowed much more efficient
sequencing, leading to lower costs. The Sanger method, in mass
production form, is the technology which produced the first human genome in 2001, ushering in the age of genomics.
However, later in the decade, radically different approaches reached
the market, bringing the cost per genome down from $100 million in 2001
to $10,000 in 2011.
Sequencing by synthesis
The objective for sequential sequencing by synthesis (SBS) is to determine the sequencing of a DNA sample by detecting the incorporation of a nucleotide by a DNA polymerase.
An engineered polymerase is used to synthesize a copy of a single
strand of DNA and the incorporation of each nucleotide is monitored. The
principle of real-time sequencing by synthesis was first described in
1993 with improvements published some years later. The key parts are highly similar for all embodiments of SBS and includes (1) amplification of DNA
(to enhance the subsequent signal) and attach the DNA to be sequenced
to a solid support, (2) generation of single stranded DNA on the solid
support, (3) incorporation of nucleotides using an engineered polymerase
and (4) real-time detection of the incorporation of nucleotide The
steps 3-4 are repeated and the sequence is assembled from the signals
obtained in step 4. This principle of real-time sequencing-by-synthesis
has been used for almost all massive parallel sequencing instruments, including 454, PacBio, IonTorrent, Illumina and MGI.
Large-scale sequencing and de novo sequencing
Genomic
DNA is fragmented into random pieces and cloned as a bacterial library.
DNA from individual bacterial clones is sequenced and the sequence is
assembled by using overlapping DNA regions.
Large-scale sequencing often aims at sequencing very long DNA pieces, such as whole chromosomes, although large-scale sequencing can also be used to generate very large numbers of short sequences, such as found in phage display. For longer targets such as chromosomes, common approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments. The fragmented DNA may then be cloned into a DNA vector and amplified in a bacterial host such as Escherichia coli. Short DNA fragments purified from individual bacterial colonies are individually sequenced and assembled electronically
into one long, contiguous sequence. Studies have shown that adding a
size selection step to collect DNA fragments of uniform size can improve
sequencing efficiency and accuracy of the genome assembly. In these
studies, automated sizing has proven to be more reproducible and precise
than manual gel sizing.
The term "de novo sequencing" specifically refers to methods used to determine the sequence of DNA with no previously known sequence. De novo translates from Latin as "from the beginning". Gaps in the assembled sequence may be filled by primer walking. The different strategies have different tradeoffs in speed and accuracy; shotgun methods are often used for sequencing large genomes, but its assembly is complex and difficult, particularly with sequence repeats often causing gaps in genome assembly.
Most sequencing approaches use an in vitro cloning step to
amplify individual DNA molecules, because their molecular detection
methods are not sensitive enough for single molecule sequencing.
Emulsion PCR isolates individual DNA molecules along with primer-coated beads in aqueous droplets within an oil phase. A polymerase chain reaction
(PCR) then coats each bead with clonal copies of the DNA molecule
followed by immobilization for later sequencing. Emulsion PCR is used in
the methods developed by Marguilis et al. (commercialized by 454 Life Sciences), Shendure and Porreca et al. (also known as "polony sequencing") and SOLiD sequencing, (developed by Agencourt, later Applied Biosystems, now Life Technologies). Emulsion PCR is also used in the GemCode and Chromium platforms developed by 10x Genomics.
Shotgun sequencing is a sequencing method designed for analysis of
DNA sequences longer than 1000 base pairs, up to and including entire
chromosomes. This method requires the target DNA to be broken into
random fragments. After sequencing individual fragments using the chain termination method, the sequences can be reassembled on the basis of their overlapping regions.
High-throughput methods
Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas.
High-throughput sequencing, which includes next-generation "short-read" and third-generation "long-read" sequencing methods, applies to exome sequencing, genome sequencing, genome resequencing, transcriptome profiling (RNA-Seq), DNA-protein interactions (ChIP-sequencing), and epigenome characterization.
The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences concurrently. High-throughput sequencing technologies are intended to lower the cost
of DNA sequencing beyond what is possible with standard dye-terminator
methods. In ultra-high-throughput sequencing as many as 500,000 sequencing-by-synthesis operations may be run in parallel. Such technologies led to the ability to sequence an entire human genome in as little as one day. As of 2019, corporate leaders in the development of high-throughput sequencing products included Illumina, Qiagen and ThermoFisher Scientific.
SMRT sequencing is based on the sequencing by synthesis approach. The
DNA is synthesized in zero-mode wave-guides (ZMWs) – small well-like
containers with the capturing tools located at the bottom of the well.
The sequencing is performed with use of unmodified polymerase (attached
to the ZMW bottom) and fluorescently labelled nucleotides flowing freely
in the solution. The wells are constructed in a way that only the
fluorescence occurring by the bottom of the well is detected. The
fluorescent label is detached from the nucleotide upon its incorporation
into the DNA strand, leaving an unmodified DNA strand. According to Pacific Biosciences
(PacBio), the SMRT technology developer, this methodology allows
detection of nucleotide modifications (such as cytosine methylation).
This happens through the observation of polymerase kinetics. This
approach allows reads of 20,000 nucleotides or more, with average read
lengths of 5 kilobases. In 2015, Pacific Biosciences announced the launch of a new sequencing
instrument called the Sequel System, with 1 million ZMWs compared to
150,000 ZMWs in the PacBio RS II instrument. SMRT sequencing is referred to as "third-generation" or "long-read" sequencing.
The DNA passing through the nanopore changes its ion current. This
change is dependent on the shape, size and length of the DNA sequence.
Each type of the nucleotide blocks the ion flow through the pore for a
different period of time. The method does not require modified
nucleotides and is performed in real time. Nanopore sequencing is
referred to as "third-generation" or "long-read" sequencing, along with SMRT sequencing.
Early industrial research into this method was based on a
technique called 'exonuclease sequencing', where the readout of
electrical signals occurred as nucleotides passed by alpha(α)-hemolysin pores covalently bound with cyclodextrin. However the subsequent commercial method, 'strand sequencing', sequenced DNA bases in an intact strand.
Two main areas of nanopore sequencing in development are solid
state nanopore sequencing, and protein based nanopore sequencing.
Protein nanopore sequencing utilizes membrane protein complexes such as
α-hemolysin, MspA (Mycobacterium smegmatis Porin A) or CssG, which show great promise given their ability to distinguish between individual and groups of nucleotides. In contrast, solid-state nanopore sequencing utilizes synthetic
materials such as silicon nitride and aluminum oxide and it is preferred
for its superior mechanical ability and thermal and chemical stability. The fabrication method is essential for this type of sequencing given
that the nanopore array can contain hundreds of pores with diameters
smaller than eight nanometers.
The concept originated from the idea that single stranded DNA or
RNA molecules can be electrophoretically driven in a strict linear
sequence through a biological pore that can be less than eight
nanometers, and can be detected given that the molecules release an
ionic current while moving through the pore. The pore contains a
detection region capable of recognizing different bases, with each base
generating various time specific signals corresponding to the sequence
of bases as they cross the pore which are then evaluated. Precise control over the DNA transport through the pore is crucial for
success. Various enzymes such as exonucleases and polymerases have been
used to moderate this process by positioning them near the pore's
entrance.
The first of the high-throughput sequencing technologies, massively parallel signature sequencing
(or MPSS, also called next generation sequencing), was developed in the
1990s at Lynx Therapeutics, a company founded in 1992 by Sydney Brenner and Sam Eletr.
MPSS was a bead-based method that used a complex approach of adapter
ligation followed by adapter decoding, reading the sequence in
increments of four nucleotides. This method made it susceptible to
sequence-specific bias or loss of specific sequences. Because the
technology was so complex, MPSS was only performed 'in-house' by Lynx
Therapeutics and no DNA sequencing machines were sold to independent
laboratories. Lynx Therapeutics merged with Solexa (later acquired by Illumina) in 2004, leading to the development of sequencing-by-synthesis, a simpler approach acquired from Manteia Predictive Medicine,
which rendered MPSS obsolete. However, the essential properties of the
MPSS output were typical of later high-throughput data types, including
hundreds of thousands of short DNA sequences. In the case of MPSS, these
were typically used for sequencing cDNA for measurements of gene expression levels.
The polony sequencing method, developed in the laboratory of George M. Church at Harvard, was among the first high-throughput sequencing systems and was used to sequence a full E. coli genome in 2005. It combined an in vitro paired-tag library with emulsion PCR, an
automated microscope, and ligation-based sequencing chemistry to
sequence an E. coli genome at an accuracy of >99.9999% and a cost approximately 1/9 that of Sanger sequencing. The technology was licensed to Agencourt Biosciences, subsequently spun
out into Agencourt Personal Genomics, and eventually incorporated into
the Applied Biosystems SOLiD platform. Applied Biosystems was later acquired by Life Technologies, now part of Thermo Fisher Scientific.
A parallelized version of pyrosequencing was developed by 454 Life Sciences, which has since been acquired by Roche Diagnostics.
The method amplifies DNA inside water droplets in an oil solution
(emulsion PCR), with each droplet containing a single DNA template
attached to a single primer-coated bead that then forms a clonal colony.
The sequencing machine contains many picoliter-volume wells each containing a single bead and sequencing enzymes. Pyrosequencing uses luciferase
to generate light for detection of the individual nucleotides added to
the nascent DNA, and the combined data are used to generate sequence reads. This technology provides intermediate read length and price per base
compared to Sanger sequencing on one end and Solexa and SOLiD on the
other.
Solexa, now part of Illumina, was founded by Shankar Balasubramanian and David Klenerman in 1998, and developed a sequencing method based on reversible dye-terminators technology, and engineered polymerases. The reversible terminated chemistry concept was invented by Bruno Canard and Simon Sarfati at the Pasteur Institute in Paris.It was developed internally at Solexa by those named on the relevant patents. In 2004, Solexa acquired the company Manteia Predictive Medicine in order to gain a massively parallel sequencing technology invented in 1997 by Pascal Mayer and Laurent Farinelli. It is based on "DNA clusters" or "DNA colonies", which involves the
clonal amplification of DNA on a surface. The cluster technology was
co-acquired with Lynx Therapeutics of California. Solexa Ltd. later
merged with Lynx to form Solexa Inc.
An Illumina HiSeq 2500 sequencerIllumina NovaSeq 6000 flow cell
In this method, DNA molecules and primers are first attached on a slide or flow cell and amplified with polymerase
so that local clonal DNA colonies, later coined "DNA clusters", are
formed. To determine the sequence, four types of reversible terminator
bases (RT-bases) are added and non-incorporated nucleotides are washed
away. A camera takes images of the fluorescently labeled
nucleotides. Then the dye, along with the terminal 3' blocker, is
chemically removed from the DNA, allowing for the next cycle to begin.
Unlike pyrosequencing, the DNA chains are extended one nucleotide at a
time and image acquisition can be performed at a delayed moment,
allowing for very large arrays of DNA colonies to be captured by
sequential images taken from a single camera.
An Illumina MiSeq sequencer
Decoupling the enzymatic reaction and the image capture allows for
optimal throughput and theoretically unlimited sequencing capacity.
With an optimal configuration, the ultimately reachable instrument
throughput is thus dictated solely by the analog-to-digital conversion
rate of the camera, multiplied by the number of cameras and divided by
the number of pixels per DNA colony required for visualizing them
optimally (approximately 10 pixels/colony). In 2012, with cameras
operating at more than 10 MHz A/D conversion rates and available optics,
fluidics and enzymatics, throughput can be multiples of 1 million
nucleotides/second, corresponding roughly to 1 human genome equivalent
at 1x coverage
per hour per instrument, and 1 human genome re-sequenced (at approx.
30x) per day per instrument (equipped with a single camera).
Combinatorial probe anchor synthesis (cPAS)
This method is an upgraded modification to combinatorial probe anchor ligation technology (cPAL) described by Complete Genomics which has since become part of Chinese genomics company BGI in 2013. The two companies have refined the technology to allow for longer read
lengths, reaction time reductions and faster time to results. In
addition, data are now generated as contiguous full-length reads in the
standard FASTQ file format and can be used as-is in most
short-read-based bioinformatics analysis pipelines.
The two technologies that form the basis for this high-throughput sequencing technology are DNA nanoballs (DNB) and patterned arrays for nanoball attachment to a solid surface. DNA nanoballs are simply formed by denaturing double stranded, adapter
ligated libraries and ligating the forward strand only to a splint
oligonucleotide to form a ssDNA circle. Faithful copies of the circles
containing the DNA insert are produced utilizing Rolling Circle
Amplification that generates approximately 300–500 copies. The long
strand of ssDNA folds upon itself to produce a three-dimensional
nanoball structure that is approximately 220 nm in diameter. Making DNBs
replaces the need to generate PCR copies of the library on the flow
cell and as such can remove large proportions of duplicate reads,
adapter-adapter ligations and PCR induced errors.
A BGI MGISEQ-2000RS sequencer
The patterned array of positively charged spots is fabricated through
photolithography and etching techniques followed by chemical
modification to generate a sequencing flow cell. Each spot on the flow
cell is approximately 250 nm in diameter, are separated by 700 nm
(centre to centre) and allows easy attachment of a single negatively
charged DNB to the flow cell and thus reducing under or over-clustering
on the flow cell.
Sequencing is then performed by addition of an oligonucleotide
probe that attaches in combination to specific sites within the DNB. The
probe acts as an anchor that then allows one of four single reversibly
inactivated, labelled nucleotides to bind after flowing across the flow
cell. Unbound nucleotides are washed away before laser excitation of the
attached labels then emit fluorescence and signal is captured by
cameras that is converted to a digital output for base calling. The
attached base has its terminator and label chemically cleaved at
completion of the cycle. The cycle is repeated with another flow of
free, labelled nucleotides across the flow cell to allow the next
nucleotide to bind and have its signal captured. This process is
completed a number of times (usually 50 to 300 times) to determine the
sequence of the inserted piece of DNA at a rate of approximately 40
million nucleotides per second as of 2018.
Two-base
encoding scheme. In two-base encoding, each unique pair of bases on the
3' end of the probe is assigned one out of four possible colors. For
example, "AA" is assigned to blue, "AC" is assigned to green, and so on
for all 16 unique pairs. During sequencing, each base in the template is
sequenced twice, and the resulting data are decoded according to this
scheme.
Applied Biosystems' (now a Life Technologies brand) SOLiD technology employs sequencing by ligation.
Here, a pool of all possible oligonucleotides of a fixed length are
labeled according to the sequenced position. Oligonucleotides are
annealed and ligated; the preferential ligation by DNA ligase
for matching sequences results in a signal informative of the
nucleotide at that position. Each base in the template is sequenced
twice, and the resulting data are decoded according to the 2 base encoding
scheme used in this method. Before sequencing, the DNA is amplified by
emulsion PCR. The resulting beads, each containing single copies of the
same DNA molecule, are deposited on a glass slide. The result is sequences of quantities and lengths comparable to Illumina sequencing. This sequencing by ligation method has been reported to have some issue sequencing palindromic sequences.
Ion Torrent Systems Inc. (now owned by Life Technologies)
developed a system based on using standard sequencing chemistry, but
with a novel, semiconductor-based detection system. This method of
sequencing is based on the detection of hydrogen ions that are released during the polymerisation of DNA,
as opposed to the optical methods used in other sequencing systems. A
microwell containing a template DNA strand to be sequenced is flooded
with a single type of nucleotide. If the introduced nucleotide is complementary
to the leading template nucleotide it is incorporated into the growing
complementary strand. This causes the release of a hydrogen ion that
triggers a hypersensitive ion sensor, which indicates that a reaction
has occurred. If homopolymer
repeats are present in the template sequence, multiple nucleotides will
be incorporated in a single cycle. This leads to a corresponding number
of released hydrogens and a proportionally higher electronic signal.
Sequencing of the TAGGCT template with IonTorrent, PacBioRS and GridION
DNA nanoball sequencing is a type of high throughput sequencing technology used to determine the entire genomic sequence of an organism. The company Complete Genomics uses this technology to sequence samples submitted by independent researchers. The method uses rolling circle replication
to amplify small fragments of genomic DNA into DNA nanoballs.
Unchained sequencing by ligation is then used to determine the
nucleotide sequence. This method of DNA sequencing allows large numbers of DNA nanoballs to be sequenced per run and at low reagent costs compared to other high-throughput sequencing platforms. However, only short sequences of DNA are determined from each DNA nanoball which makes mapping the short reads to a reference genome difficult.
Heliscope single molecule sequencing
Heliscope sequencing is a method of single-molecule sequencing developed by Helicos Biosciences.
It uses DNA fragments with added poly-A tail adapters which are
attached to the flow cell surface. The next steps involve
extension-based sequencing with cyclic washes of the flow cell with
fluorescently labeled nucleotides (one nucleotide type at a time, as
with the Sanger method). The reads are performed by the Heliscope
sequencer. The reads are short, averaging 35 bp. What made this technology especially novel was that it was the first of
its class to sequence non-amplified DNA, thus preventing any read
errors associated with amplification steps. In 2009 a human genome was sequenced using the Heliscope, however in 2012 the company went bankrupt.
Microfluidic Systems
There are two main microfluidic systems that are used to sequence DNA; droplet based microfluidics and digital microfluidics. Microfluidic devices solve many of the current limitations of current sequencing arrays.
Abate et al. studied the use of droplet-based microfluidic devices for DNA sequencing. These devices have the ability to form and process picoliter sized
droplets at the rate of thousands per second. The devices were created
from polydimethylsiloxane (PDMS) and used Forster resonance energy transfer, FRET assays to read the sequences of DNA encompassed in the droplets. Each position on the array tested for a specific 15 base sequence.
Fair et al. used digital microfluidic devices to study DNA pyrosequencing. Significant advantages include the portability of the device, reagent
volume, speed of analysis, mass manufacturing abilities, and high
throughput. This study provided a proof of concept showing that digital
devices can be used for pyrosequencing; the study included using
synthesis, which involves the extension of the enzymes and addition of
labeled nucleotides.
Boles et al. also studied pyrosequencing on digital microfluidic devices. They used an electro-wetting device to create, mix, and split droplets.
The sequencing uses a three-enzyme protocol and DNA templates anchored
with magnetic beads. The device was tested using two protocols and
resulted in 100% accuracy based on raw pyrogram levels. The advantages
of these digital microfluidic devices include size, cost, and achievable
levels of functional integration.
DNA sequencing research, using microfluidics, also has the ability to be applied to the sequencing of RNA, using similar droplet microfluidic techniques, such as the method, inDrops. This shows that many of these DNA sequencing techniques will be able to
be applied further and be used to understand more about genomes and
transcriptomes.
Methods in development
DNA sequencing methods currently under development include reading the sequence as a DNA strand transits through nanopores (a method that is now commercial but subsequent generations such as solid-state nanopores are still in development), and microscopy-based techniques, such as atomic force microscopy or transmission electron microscopy
that are used to identify the positions of individual nucleotides
within long DNA fragments (>5,000 bp) by nucleotide labeling with
heavier elements (e.g., halogens) for visual detection and recording. Third generation technologies
aim to increase throughput and decrease the time to result and cost by
eliminating the need for excessive reagents and harnessing the
processivity of DNA polymerase.
Tunnelling currents DNA sequencing
Another
approach uses measurements of the electrical tunnelling currents across
single-strand DNA as it moves through a channel. Depending on its
electronic structure, each base affects the tunnelling current
differently, allowing differentiation between different bases.
The use of tunnelling currents has the potential to sequence
orders of magnitude faster than ionic current methods and the sequencing
of several DNA oligomers and micro-RNA has already been achieved.
Sequencing by hybridization
Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray.
A single pool of DNA whose sequence is to be determined is
fluorescently labeled and hybridized to an array containing known
sequences. Strong hybridization signals from a given spot on the array
identifies its sequence in the DNA being sequenced.
This method of sequencing utilizes binding characteristics of a
library of short single stranded DNA molecules (oligonucleotides), also
called DNA probes, to reconstruct a target DNA sequence. Non-specific
hybrids are removed by washing and the target DNA is eluted. Hybrids are re-arranged such that the DNA sequence can be
reconstructed. The benefit of this sequencing type is its ability to
capture a large number of targets with a homogenous coverage. A large number of chemicals and starting DNA is usually required.
However, with the advent of solution-based hybridization, much less
equipment and chemicals are necessary.
Sequencing with mass spectrometry
Mass spectrometry may be used to determine DNA sequences. Matrix-assisted laser desorption ionization time-of-flight mass spectrometry, or MALDI-TOF MS,
has specifically been investigated as an alternative method to gel
electrophoresis for visualizing DNA fragments. With this method, DNA
fragments generated by chain-termination sequencing reactions are
compared by mass rather than by size. The mass of each nucleotide is
different from the others and this difference is detectable by mass
spectrometry. Single-nucleotide mutations in a fragment can be more
easily detected with MS than by gel electrophoresis alone. MALDI-TOF MS
can more easily detect differences between RNA fragments, so researchers
may indirectly sequence DNA with MS-based methods by converting it to
RNA first.
The higher resolution of DNA fragments permitted by MS-based
methods is of special interest to researchers in forensic science, as
they may wish to find single-nucleotide polymorphisms in human DNA samples to identify individuals. These samples may be highly degraded so forensic researchers often prefer mitochondrial DNA
for its higher stability and applications for lineage studies. MS-based
sequencing methods have been used to compare the sequences of human
mitochondrial DNA from samples in a Federal Bureau of Investigation database and from bones found in mass graves of World War I soldiers.
Early chain-termination and TOF MS methods demonstrated read lengths of up to 100 base pairs. Researchers have been unable to exceed this average read size; like
chain-termination sequencing alone, MS-based DNA sequencing may not be
suitable for large de novo sequencing projects. Even so, a 2010
study did use the short sequence reads and mass spectroscopy to compare
single-nucleotide polymorphisms in pathogenic Streptococcus strains.
In microfluidic Sanger sequencing
the entire thermocycling amplification of DNA fragments as well as
their separation by electrophoresis is done on a single glass wafer
(approximately 10 cm in diameter) thus reducing the reagent usage as
well as cost. In some instances researchers have shown that they can increase the
throughput of conventional sequencing through the use of microchips. Research will still need to be done in order to make this use of technology effective.
This approach directly visualizes the sequence of DNA molecules using
electron microscopy. The first identification of DNA base pairs within
intact DNA molecules by enzymatically incorporating modified bases,
which contain atoms of increased atomic number, direct visualization and
identification of individually labeled bases within a synthetic 3,272
base-pair DNA molecule and a 7,249 base-pair viral genome has been
demonstrated.
RNAP sequencing
This method is based on use of RNA polymerase (RNAP), which is attached to a polystyrene
bead. One end of DNA to be sequenced is attached to another bead, with
both beads being placed in optical traps. RNAP motion during
transcription brings the beads in closer and their relative distance
changes, which can then be recorded at a single nucleotide resolution.
The sequence is deduced based on the four readouts with lowered
concentrations of each of the four nucleotide types, similarly to the
Sanger method. A comparison is made between regions and sequence information is
deduced by comparing the known sequence regions to the unknown sequence
regions.
In vitro virus high-throughput sequencing
A method has been developed to analyze full sets of protein interactions using a combination of 454 pyrosequencing and an in vitro virus mRNA display
method. Specifically, this method covalently links proteins of interest
to the mRNAs encoding them, then detects the mRNA pieces using reverse
transcription PCRs.
The mRNA may then be amplified and sequenced. The combined method was
titled IVV-HiTSeq and can be performed under cell-free conditions,
though its results may not be representative of in vivo conditions.
Market share
While
there are many different ways to sequence DNA, only a few dominate the
market. In 2022, Illumina had about 80% of the market; the rest of the
market is taken by only a few players (PacBio, Oxford, 454, MGI)
Sample preparation
The
success of any DNA sequencing protocol relies upon the DNA or RNA
sample extraction and preparation from the biological material of
interest.
A successful DNA extraction will yield a DNA sample with long, non-degraded strands.
A successful RNA extraction will yield a RNA sample that should be
converted to complementary DNA (cDNA) using reverse transcriptase—a DNA
polymerase that synthesizes a complementary DNA based on existing
strands of RNA in a PCR-like manner. Complementary DNA can then be processed the same way as genomic DNA.
After DNA or RNA extraction, samples may require further preparation
depending on the sequencing method. For Sanger sequencing, either
cloning procedures or PCR are required prior to sequencing. In the case
of next-generation sequencing methods, library preparation is required
before processing. Assessing the quality and quantity of nucleic acids both after
extraction and after library preparation identifies degraded,
fragmented, and low-purity samples and yields high-quality sequencing
data.
Development initiatives
Total cost of sequencing a human genome over time as calculated by the NHGRI
In October 2006, the X Prize Foundation established an initiative to promote the development of full genome sequencing technologies, called the Archon X Prize,
intending to award $10 million to "the first Team that can build a
device and use it to sequence 100 human genomes within 10 days or less,
with an accuracy of no more than one error in every 100,000 bases
sequenced, with sequences accurately covering at least 98% of the
genome, and at a recurring cost of no more than $10,000 (US) per
genome."
Each year the National Human Genome Research Institute, or NHGRI, promotes grants for new research and developments in genomics. 2010 grants and 2011 candidates include continuing work in microfluidic, polony and base-heavy sequencing methodologies.
Computational challenges
The
sequencing technologies described here produce raw data that needs to
be assembled into longer sequences such as complete genomes (sequence assembly).
There are many computational challenges to achieve this, such as the
evaluation of the raw sequence data which is done by programs and
algorithms such as Phred and Phrap. Other challenges have to deal with repetitive
sequences that often prevent complete genome assemblies because they
occur in many places of the genome. As a consequence, many sequences may
not be assigned to particular chromosomes. The production of raw sequence data is only the beginning of its detailed bioinformatical analysis. Yet new methods for sequencing and correcting sequencing errors were developed.
Read trimming
Sometimes,
the raw reads produced by the sequencer are correct and precise only in
a fraction of their length. Using the entire read may introduce
artifacts in the downstream analyses like genome assembly, SNP calling,
or gene expression estimation. Two classes of trimming programs have
been introduced, based on the window-based or the running-sum classes of
algorithms. This is a partial list of the trimming algorithms currently available, specifying the algorithm class they belong to:
Human genetics have been included within the field of bioethics since the early 1970s and the growth in the use of DNA sequencing (particularly
high-throughput sequencing) has introduced a number of ethical issues.
One key issue is the ownership of an individual's DNA and the data
produced when that DNA is sequenced.[176] Regarding the DNA molecule itself, the leading legal case on this topic, Moore v. Regents of the University of California
(1990) ruled that individuals have no property rights to discarded
cells or any profits made using these cells (for instance, as a patented
cell line).
However, individuals have a right to informed consent regarding removal
and use of cells. Regarding the data produced through DNA sequencing, Moore gives the individual no rights to the information derived from their DNA.
As DNA sequencing becomes more widespread, the storage, security and sharing of genomic data has also become more important. For instance, one concern is that insurers may use an individual's
genomic data to modify their quote, depending on the perceived future
health of the individual based on their DNA. In May 2008, the Genetic Information Nondiscrimination Act
(GINA) was signed in the United States, prohibiting discrimination on
the basis of genetic information with respect to health insurance and
employment. In 2012, the US Presidential Commission for the Study of Bioethical Issues reported that existing privacy legislation for DNA sequencing data such as GINA and the Health Insurance Portability and Accountability Act
were insufficient, noting that whole-genome sequencing data was
particularly sensitive, as it could be used to identify not only the
individual from which the data was created, but also their relatives.
In most of the United States, DNA that is "abandoned", such as
that found on a licked stamp or envelope, coffee cup, cigarette, chewing
gum, household trash, or hair that has fallen on a public sidewalk, may
legally be collected and sequenced by anyone, including the police,
private investigators, political opponents, or people involved in
paternity disputes. As of 2013, eleven states have laws that can be
interpreted to prohibit "DNA theft".
Ethical issues have also been raised by the increasing use of
genetic variation screening, both in newborns, and in adults by
companies such as 23andMe. It has been asserted that screening for genetic variations can be harmful, increasing anxiety in individuals who have been found to have an increased risk of disease. For example, in one case noted in Time, doctors screening an ill baby for genetic variants chose not to inform the parents of an unrelated variant linked to dementia due to the harm it would cause to the parents. However, a 2011 study in The New England Journal of Medicine has shown that individuals undergoing disease risk profiling did not show increased levels of anxiety. Also, the development of Next Generation sequencing technologies such
as Nanopore based sequencing has also raised further ethical concerns.