Non-coding DNA sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, ribosomal RNA, and regulatory RNAs). Other functions of non-coding DNA include the transcriptional and translational regulation of protein-coding sequences, scaffold attachment regions, origins of DNA replication, centromeres and telomeres.
The amount of non-coding DNA varies greatly among species. Often, only a small percentage of the genome is responsible for coding proteins, but an increasing percentage is being shown to have regulatory functions. When there is much non-coding DNA, a large proportion appears to have no biological function, as predicted in the 1960s. Since that time, this non-functional portion has controversially been called "junk DNA".
The international Encyclopedia of DNA Elements (ENCODE) project uncovered, by direct biochemical approaches, that at least 80% of human genomic DNA has biochemical activity. Though this was not necessarily unexpected due to previous decades of research discovering many functional non-coding regions, some scientists criticized the conclusion for conflating biochemical activity with biological function. Estimates for the biologically functional fraction of the human genome based on comparative genomics range between 8 and 15%. However, others have argued against relying solely on estimates from comparative genomics due to its limited scope. Non-coding DNA has been found to be involved in epigenetic activity and complex networks of genetic interactions and is being explored in evolutionary developmental biology.
The amount of non-coding DNA varies greatly among species. Often, only a small percentage of the genome is responsible for coding proteins, but an increasing percentage is being shown to have regulatory functions. When there is much non-coding DNA, a large proportion appears to have no biological function, as predicted in the 1960s. Since that time, this non-functional portion has controversially been called "junk DNA".
The international Encyclopedia of DNA Elements (ENCODE) project uncovered, by direct biochemical approaches, that at least 80% of human genomic DNA has biochemical activity. Though this was not necessarily unexpected due to previous decades of research discovering many functional non-coding regions, some scientists criticized the conclusion for conflating biochemical activity with biological function. Estimates for the biologically functional fraction of the human genome based on comparative genomics range between 8 and 15%. However, others have argued against relying solely on estimates from comparative genomics due to its limited scope. Non-coding DNA has been found to be involved in epigenetic activity and complex networks of genetic interactions and is being explored in evolutionary developmental biology.
Fraction of non-coding genomic DNA
The amount of total genomic DNA varies widely between organisms, and
the proportion of coding and non-coding DNA within these genomes varies
greatly as well. For example, it was originally suggested that over 98%
of the human genome does not encode protein sequences, including most sequences within introns and most intergenic DNA, while 20% of a typical prokaryote genome is non-coding.
In eukaryotes, genome size, and by extension the amount of non-coding DNA, is not correlated to organism complexity, an observation known as the C-value enigma. For example, the genome of the unicellular Polychaos dubium (formerly known as Amoeba dubia) has been reported to contain more than 200 times the amount of DNA in humans. The pufferfish Takifugu rubripes
genome is only about one eighth the size of the human genome, yet seems
to have a comparable number of genes; approximately 90% of the Takifugu genome is non-coding DNA.
Therefore, most of the difference in genome size is not due to
variation in amount of coding DNA, rather, it is due to a difference in
the amount of non-coding DNA.
In 2013, a new "record" for the most efficient eukaryotic genome was discovered with Utricularia gibba, a bladderwort
plant that has only 3% non-coding DNA and 97% of coding DNA. Parts of
the non-coding DNA were being deleted by the plant and this suggested
that non-coding DNA may not be as critical for plants, even though
non-coding DNA is useful for humans.
Other studies on plants have discovered crucial functions in portions
of non-coding DNA that were previously thought to be negligible and have
added a new layer to the understanding of gene regulation.
Types of non-coding DNA sequences
Cis- and trans-regulatory elements
Cis-regulatory elements are sequences that control the transcription of a nearby gene. Many such elements are involved in the evolution and control of development. Cis-elements may be located in 5' or 3' untranslated regions or within introns. Trans-regulatory elements control the transcription of a distant gene.
Promoters facilitate the transcription of a particular gene and are typically upstream of the coding region. Enhancer sequences may also exert very distant effects on the transcription levels of genes.
Introns
Introns are non-coding sections of a gene, transcribed into the precursor mRNA sequence, but ultimately removed by RNA splicing during the processing to mature messenger RNA. Many introns appear to be mobile genetic elements.
Studies of group I introns from Tetrahymena protozoans
indicate that some introns appear to be selfish genetic elements,
neutral to the host because they remove themselves from flanking exons during RNA processing and do not produce an expression bias between alleles with and without the intron. Some introns appear to have significant biological function, possibly through ribozyme functionality that may regulate tRNA and rRNA
activity as well as protein-coding gene expression, evident in hosts
that have become dependent on such introns over long periods of time;
for example, the trnL-intron is found in all green plants and appears to have been vertically inherited for several billions of years, including more than a billion years within chloroplasts and an additional 2–3 billion years prior in the cyanobacterial ancestors of chloroplasts.
Pseudogenes
Pseudogenes are DNA sequences, related to known genes, that have lost their protein-coding ability or are otherwise no longer expressed
in the cell. Pseudogenes arise from retrotransposition or genomic
duplication of functional genes, and become "genomic fossils" that are
nonfunctional due to mutations that prevent the transcription of the gene, such as within the gene promoter region, or fatally alter the translation of the gene, such as premature stop codons or frameshifts.
Pseudogenes resulting from the retrotransposition of an RNA
intermediate are known as processed pseudogenes; pseudogenes that arise
from the genomic remains of duplicated genes or residues of inactivated genes are nonprocessed pseudogenes. Transpositions of once functional mitochondrial genes from the cytoplasm to the nucleus, also known as NUMTs, also qualify as one type of common pseudogene. Numts occur in many eukaryotic taxa.
While Dollo's Law
suggests that the loss of function in pseudogenes is likely permanent,
silenced genes may actually retain function for several million years
and can be "reactivated" into protein-coding sequences and a substantial number of pseudogenes are actively transcribed.
Because pseudogenes are presumed to change without evolutionary
constraint, they can serve as a useful model of the type and frequencies
of various spontaneous genetic mutations.
Repeat sequences, transposons and viral elements
Transposons and retrotransposons are mobile genetic elements. Retrotransposon repeated sequences, which include long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs), account for a large proportion of the genomic sequences in many species. Alu sequences,
classified as a short interspersed nuclear element, are the most
abundant mobile elements in the human genome. Some examples have been
found of SINEs exerting transcriptional control of some protein-encoding
genes.
Endogenous retrovirus sequences are the product of reverse transcription of retrovirus genomes into the genomes of germ cells. Mutation within these retro-transcribed sequences can inactivate the viral genome.
Over 8% of the human genome is made up of (mostly decayed)
endogenous retrovirus sequences, as part of the over 42% fraction that
is recognizably derived of retrotransposons, while another 3% can be
identified to be the remains of DNA transposons.
Much of the remaining half of the genome that is currently without an
explained origin is expected to have found its origin in transposable
elements that were active so long ago (> 200 million years) that
random mutations have rendered them unrecognizable. Genome size variation in at least two kinds of plants is mostly the result of retrotransposon sequences.
Telomeres
Telomeres are regions of repetitive DNA at the end of a chromosome, which provide protection from chromosomal deterioration during DNA replication.
Recent studies have shown that telomeres function to aid in its own
stability. Telomeric repeat-containing RNA (TERRA) are transcripts
derived from telomeres. TERRA has been shown to maintain telomerase
activity and lengthen the ends of chromosomes.
Junk DNA
The term "junk DNA" became popular in the 1960s. According to T. Ryan Gregory,
the nature of junk DNA was first discussed explicitly in 1972 by a
genomic biologist, David Comings, who applied the term to all non-coding
DNA. The term was formalized that same year by Susumu Ohno, who noted that the mutational load
from deleterious mutations placed an upper limit on the number of
functional loci that could be expected given a typical mutation rate.
Ohno hypothesized that mammal genomes could not have more than 30,000
loci under selection before the "cost" from the mutational load would
cause an inescapable decline in fitness, and eventually extinction. This
prediction remains robust, with the human genome containing
approximately 20,000 genes. Another source for Ohno's theory was the
observation that even closely related species can have widely
(orders-of-magnitude) different genome sizes, which had been dubbed the C-value paradox in 1971.
Though the fruitfulness of the term "junk DNA" has been questioned on the grounds that it provokes a strong a priori
assumption of total non-functionality and though some have recommended
using more neutral terminology such as "non-coding DNA" instead; "junk DNA" remains a label for the portions of a genome sequence for which no discernible function has been identified and that through comparative genomics analysis appear under no functional constraint suggesting that the sequence itself has provided no adaptive advantage. Since the late 70s it has become apparent that the majority of non-coding DNA in large genomes finds its origin in the selfish amplification of transposable elements, of which W. Ford Doolittle and Carmen Sapienza in 1980 wrote in the journal Nature:
"When a given DNA, or class of DNAs, of unproven phenotypic function
can be shown to have evolved a strategy (such as transposition) which
ensures its genomic survival, then no other explanation for its
existence is necessary."
The amount of junk DNA can be expected to depend on the rate of
amplification of these elements and the rate at which non-functional DNA
is lost. In the same issue of Nature, Leslie Orgel and Francis Crick wrote that junk DNA has "little specificity and conveys little or no selective advantage to the organism". The term occurs mainly in popular science and in a colloquial
way in scientific publications, and it has been suggested that its
connotations may have delayed interest in the biological functions of
non-coding DNA.
Several lines of evidence indicate that some "junk DNA" sequences are
likely to have unidentified functional activity and that the process of exaptation of fragments of originally selfish or non-functional DNA has been commonplace throughout evolution.
ENCODE Project
In 2012, the ENCODE project, a research program supported by the National Human Genome Research Institute, reported that 76% of the human genome's non-coding DNA sequences were transcribed and that nearly half of the genome was in some way accessible to genetic regulatory proteins such as transcription factors.
However, the suggestion by ENCODE that over 80% of the human genome is
biochemically functional has been criticized by other scientists,
who argue that neither accessibility of segments of the genome to
transcription factors nor their transcription guarantees that those
segments have biochemical function and that their transcription is selectively advantageous.
Furthermore, the much lower estimates of functionality prior to ENCODE
were based on genomic conservation estimates across mammalian lineages.
In response to such views, other scientists argue that the wide spread
transcription and splicing that is observed in the human genome directly
by biochemical testing is a more accurate indicator of genetic function
than genomic conservation because conservation estimates are relative
due to incredible variations in genome sizes of even closely related
species, it is partially tautological, and these estimates are not based
on direct testing for functionality on the genome.
Conservation estimates may be used to provide clues to identify
possible functional elements in the genome, but it does not limit or cap
the total amount of functional elements that could possibly exist in
the genome since elements that do things at the molecular level can be
missed by comparative genomics. Furthermore, much of the apparent junk DNA is involved in epigenetic regulation and appears to be necessary for the development of complex organisms.
In a 2014 paper, ENCODE researchers tried to address "the question of
whether nonconserved but biochemically active regions are truly
functional". They noted that in the literature, functional parts of the
genome have been identified differently in previous studies depending on
the approaches used. There have been three general approaches used to
identify functional parts of the human genome: genetic approaches (which
rely on changes in phenotype), evolutionary approaches (which rely on
conservation) and biochemical approaches (which rely on biochemical
testing and was used by ENCODE). All three have limitations: genetic
approaches may miss functional elements that do not manifest physically
on the organism, evolutionary approaches have difficulties using
accurate multispecies sequence alignments since genomes of even closely
related species vary considerably, and with biochemical approaches,
though having high reproducibility, the biochemical signatures do not
always automatically signify a function.
They noted that 70% of the transcription coverage was less than 1
transcript per cell. They noted that this "larger proportion of genome
with reproducible but low biochemical signal strength and less
evolutionary conservation is challenging to parse between specific
functions and biological noise". Furthermore, assay resolution often is
much broader than the underlying functional sites so some of the
reproducibly "biochemically active but selectively neutral" sequences
are unlikely to serve critical functions, especially those with
lower-level biochemical signal. To this they added, "However, we also
acknowledge substantial limitations in our current detection of
constraint, given that some human-specific functions are essential but
not conserved and that disease-relevant regions need not be selectively
constrained to be functional." On the other hand, they argued that the
12–15% fraction of human DNA under functional constraint, as estimated
by a variety of extrapolative evolutionary methods, may still be an
underestimate. They concluded that in contrast to evolutionary and
genetic evidence, biochemical data offer clues about both the molecular
function served by underlying DNA elements and the cell types in which
they act. Ultimately genetic, evolutionary, and biochemical approaches
can all be used in a complementary way to identify regions that may be
functional in human biology and disease.
Some critics have argued that functionality can only be assessed in reference to an appropriate null hypothesis.
In this case, the null hypothesis would be that these parts of the
genome are non-functional and have properties, be it on the basis of
conservation or biochemical activity, that would be expected of such
regions based on our general understanding of molecular evolution and biochemistry.
According to these critics, until a region in question has been shown
to have additional features, beyond what is expected of the null
hypothesis, it should provisionally be labelled as non-functional.
Evidence of functionality
Many non-coding DNA sequences must have some important biological function. This is indicated by comparative genomics studies that report highly conserved regions of non-coding DNA, sometimes on time-scales of hundreds of millions of years. This implies that these non-coding regions are under strong evolutionary pressure and positive selection. For example, in the genomes of humans and mice, which diverged from a common ancestor
65–75 million years ago, protein-coding DNA sequences account for only
about 20% of conserved DNA, with the remaining 80% of conserved DNA
represented in non-coding regions. Linkage mapping
often identifies chromosomal regions associated with a disease with no
evidence of functional coding variants of genes within the region,
suggesting that disease-causing genetic variants lie in the non-coding
DNA. The significance of non-coding DNA mutations in cancer was explored in April 2013.
Non-coding genetic polymorphisms play a role in infectious disease susceptibility, such as hepatitis C.[49] Moreover, non-coding genetic polymorphisms contribute to susceptibility to Ewing sarcoma, an aggressive pediatric bone cancer.
Some specific sequences of non-coding DNA may be features essential to chromosome structure, centromere function and recognition of homologous chromosomes during meiosis.
According to a comparative study of over 300 prokaryotic and over 30 eukaryotic genomes,
eukaryotes appear to require a minimum amount of non-coding DNA. The
amount can be predicted using a growth model for regulatory genetic
networks, implying that it is required for regulatory purposes. In
humans the predicted minimum is about 5% of the total genome.
Over 10% of 32 mammalian genomes may function through the formation of specific RNA secondary structures. The study used comparative genomics to identify compensatory DNA mutations that maintain RNA base-pairings, a distinctive feature of RNA
molecules. Over 80% of the genomic regions presenting evolutionary
evidence of RNA structure conservation do not present strong DNA
sequence conservation.
Non-coding DNA separates genes from each other with long gaps, so mutation in one gene or part of a chromosome, for example deletion or insertion, does not have a frameshift effect
on the whole chromosome. When genome complexity is relatively high,
like in the case of human genome, not only between different genes, but
also inside many genes, there are gaps of introns
to protect the entire coding segment and minimise the changes caused by
mutation. Non-coding DNA may perhaps serve to decrease the probability
of gene disruption during chromosomal crossover.
Regulating gene expression
Some non-coding DNA sequences determine the expression levels of
various genes, both those that are transcribed to proteins and those
that themselves are involved in gene regulation.
Transcription factors
Some non-coding DNA sequences determine where transcription factors attach.
A transcription factor is a protein that binds to specific non-coding
DNA sequences, thereby controlling the flow (or transcription) of
genetic information from DNA to mRNA.
Operators
An operator is a segment of DNA to which a repressor
binds. A repressor is a DNA-binding protein that regulates the
expression of one or more genes by binding to the operator and blocking
the attachment of RNA polymerase to the promoter, thus preventing transcription of the genes. This blocking of expression is called repression.
Enhancers
An enhancer is a short region of DNA that can be bound with proteins (trans-acting factors), much like a set of transcription factors, to enhance transcription levels of genes in a gene cluster.
Silencers
A silencer is a region of DNA that inactivates gene expression when
bound by a regulatory protein. It functions in a very similar way as
enhancers, only differing in the inactivation of genes.
Promoters
A promoter is a region of DNA that facilitates transcription of a
particular gene when a transcription factor binds to it. Promoters are
typically located near the genes they regulate and upstream of them.
Insulators
A genetic insulator is a boundary element that plays two distinct
roles in gene expression, either as an enhancer-blocking code, or rarely
as a barrier against condensed chromatin. An insulator in a DNA
sequence is comparable to a linguistic word divider such as a comma in a sentence, because the insulator indicates where an enhanced or repressed sequence ends.
Uses
Evolution
Shared sequences of apparently non-functional DNA are a major line of evidence of common descent.
Pseudogene sequences appear to accumulate mutations more rapidly than coding sequences due to a loss of selective pressure.
This allows for the creation of mutant alleles that incorporate new
functions that may be favored by natural selection; thus, pseudogenes
can serve as raw material for evolution and can be considered "protogenes".
A study published in 2019 shows that new genes (termed de novo gene birth) can be fashioned from non-coding regions. Some studies suggest at least one-tenth of genes could be made in this way.
Long range correlations
A
statistical distinction between coding and non-coding DNA sequences has
been found. It has been observed that nucleotides in non-coding DNA
sequences display long range power law correlations while coding
sequences do not.
Forensic anthropology
Police sometimes gather DNA as evidence for purposes of forensic identification. As described in Maryland v. King, a 2013 U.S. Supreme Court decision:
The current standard for forensic DNA testing relies on an analysis of the chromosomes located within the nucleus of all human cells. 'The DNA material in chromosomes is composed of "coding" and "non-coding" regions. The coding regions are known as genes and contain the information necessary for a cell to make proteins. . . . Non-protein coding regions . . . are not related directly to making proteins, [and] have been referred to as "junk" DNA.' The adjective "junk" may mislead the lay person, for in fact this is the DNA region used with near certainty to identify a person.