Barcoding
can be done from tissue from a target specimen, from a mixture of
organisms (bulk sample), or from DNA present in environmental samples
(e.g. water or soil). The methods for sampling, preservation or analysis
differ between those different types of sample.
To barcode a tissue sample from the target specimen, a small
piece of skin, a scale, a leg or antenna is likely to be sufficient
(depending on the size of the specimen). To avoid contamination, it is
necessary to sterilize used tools between samples. It is recommended to
collect two samples from one specimen, one to archive, and one for the
barcoding process. Sample preservation is crucial to overcome the issue
of DNA degradation.
Amplification of the extracted DNA is a required step in DNA barcoding. Typically, only a small fragment of the total DNA material is
sequenced (typically 400–800
base pairs)
to obtain the DNA barcode. Amplification of eDNA material is usually
focused on smaller fragment sizes (<200 amplicon="" and="" argue="" as="" base="" be="" between="" detection="" dna="" edna.="" edna="" fragmented="" from="" however="" is="" likely="" material="" more="" no="" of="" other="" p="" pairs="" rate="" relationship="" size="" some="" sources.="" studies="" than="" that="" there="" to="">200>
HiSeq sequencers at SciLIfeLab in Uppsala, Sweden. The photo was taken during the excursion of SLU course PNS0169 in March 2019.
When the DNA barcode marker region has been amplified, the next step is to sequence the marker region using
DNA sequencing methods. Many different sequencing platforms are available, and technical development is proceeding rapidly.
Marker selection
A schematic view of primers and target region, demonstrated on 16S rRNA gene in Pseudomonas.
As primers, one typically selects short conserved sequences with low
variability, which can thus amplify most or all species in the chosen
target group. The primers are used to amplify a highly variable target
region in between the two primers, which is then used for species
discrimination. Modified from »Variable Copy Number, Intra-Genomic
Heterogeneities and Lateral Transfers of the 16S rRNA Gene in
Pseudomonas« by Bodilis, Josselin; Nsigue-Meilo, Sandrine; Besaury,
Ludovic; Quillet, Laurent, used under CC BY, available from:
https://www.researchgate.net/figure/Hypervariable-regions-within-the-16S-rRNA-gene-in-Pseudomonas-The-plotted-line-reflects_fig2_224832532.
Markers used for DNA barcoding are called barcodes. In order to
successfully characterize species based on DNA barcodes, selection of
informative DNA regions is crucial. A good DNA barcode should have low
intra-specific and high inter-specific
variability and possess
conserved flanking sites for developing universal
PCR primers for wide
taxonomic
application. The goal is to design primers that will detect and
distinguish most or all the species in the studied group of organisms
(high taxonomic resolution). The length of the barcode sequence should
be short enough to be used with current sampling source,
DNA extraction,
amplification and
sequencing methods.
Ideally, one
gene sequence would be used for all taxonomic groups, from
viruses to
plants and
animals. However, no such gene region has been found yet, so different barcodes are used for different groups of organisms, or depending on the study question.
In
plants, however, mitochondrial genes are not appropriate for DNA barcoding because they exhibit low
mutation rates. A few candidate genes have been found in the
chloroplast genome, the most promising being
maturase K gene (
matK) by itself or in association with other genes. Multi-
locus markers such as ribosomal
internal transcribed spacers (ITS DNA) along with
matK,
rbcL,
trnH or other genes have also been used for species identification. The best discrimination between plant species has been achieved when using two or more chloroplast barcodes.
For
bacteria, the small subunit of ribosomal RNA (
16S) gene can be used for different taxa, as it is highly conserved. Some studies suggest
COI, type II
chaperonin (
cpn60) or β subunit of
RNA polymerase (
rpoB) also could serve as bacterial DNA barcodes.
Barcoding
fungi is more challenging, and more than one primer combination might be required. The
COI marker performs well in certain fungi groups, but not equally well in others. Therefore, additional markers are being used, such as
ITS rDNA and the
large subunit of nuclear ribosomal RNA (LSU).
Markers that have been used for DNA barcoding in different organism groups, modified from Purty and Chatterjee.
Organism group
|
Marker gene/locus
|
Animals
|
COI, Cytb, 12S, 16S
|
Plants
|
matK, rbcL, psbA-trnH, ITS
|
Bacteria
|
COI, rpoB, 16S, cpn60, tuf, RIF, gnd
|
Fungi
|
ITS, RPB1 (LSU), RPB2 (LSU), 18S (SSU)
|
Protists
|
ITS, COI, rbcL, 18S, 28S
|
Reference libraries and bioinformatics
Reference
libraries are used for the taxonomic identification, also called
annotation, of sequences obtained from barcoding or metabarcoding. These
databases contain the DNA barcodes assigned to previously identified
taxa. Most reference libraries do not cover all species within an
organism group, and new entries are continually created. In the case of
macro- and many microorganisms (such as algae), these reference
libraries require detailed documentation (sampling location and date,
person who collected it, image, etc.) and authoritative taxonomic
identification of the voucher specimen, as well as submission of
sequences in a particular format. The process also requires the storage
of voucher specimens in museum collections and other collaborating
institutions. Both taxonomically comprehensive coverage and content
quality are important for identification accuracy.
Several reference databases exist depending on the organism group and
the genetic marker used. There are smaller, national databases (e.g.
FinBOL), and large consortia like the International Barcode of Life
Project (iBOL).
Launched in 2007, the
Barcode of Life Data System (BOLD)
is one of the biggest databases, containing more than 450 000 BINs
(Barcode Index Numbers) in 2019. It is a freely accessible repository
for the specimen and sequence records for barcode studies, and it is
also a workbench aiding the management, quality assurance and analysis
of barcode data. The database mainly contains BIN records for animals
based on the COI genetic marker.
The UNITE database
was launched in 2003 and is a reference database for the molecular
identification of fungal species with the internal transcribed spacer
(ITS) genetic marker region. This database is based on the concept of
species hypotheses: you choose the % at which you want to work, and the
sequences are sorted in comparison to sequences obtained from voucher
specimens identified by experts.
Diat.barcode database was first published under the name R-syst::diatom
in 2016 starting with data from two sources: the Thonon culture
collection (TCC) in the hydrobiological station of the French National
Institute for Agricultural Research (INRA), and from the NCBI (National
Center for Biotechnology Information) nucleotide database. Diat.barcode
provides data for two genetic markers, rbcL
(Ribulose-1,5-bisphosphate carboxylase/oxygenase) and 18S (18S ribosomal
RNA). The database also involves additional, trait information of
species, like morphological characteristics (biovolume, size dimensions,
etc.), life-forms (mobility, colony-type, etc.) or ecological features
(pollution sensitivity, etc.).
Bioinformatic analysis
In
order to obtain well structured, clean and interpretable data, raw
sequencing data must be processed using bioinformatic analysis. The
FASTQ file with the sequencing data contains two types of information: the sequences detected in the sample (
FASTA file) and a quality file with quality scores (
PHRED
scores) associated with each nucleotide of each DNA sequence. The PHRED
scores indicate the probability with which the associated nucleotide
has been correctly scored.
PHRED quality score and the associated certainty level
10
|
90%
|
20
|
99%
|
30
|
99.9%
|
40
|
99.99%
|
50
|
99.999%
|
In general, the PHRED score decreases towards the end of each DNA
sequence. Thus some bioinformatics pipelines simply cut the end of the
sequences at a defined threshold.
Some sequencing technologies, like MiSeq, use paired-end
sequencing during which sequencing is performed from both directions
producing better quality. The overlapping sequences are then aligned
into contigs and merged. Usually, several samples are pooled in one run,
and each sample is characterized by a short DNA fragment, the tag. In a
demultiplexing step, sequences are sorted using these tags to
reassemble the separate samples. Before further analysis, tags and other
adapters are removed from the barcoding sequence DNA fragment. During
trimming, the bad quality sequences (low PHRED scores), or sequences
that are much shorter or longer than the targeted DNA barcode, are
removed. The following dereplication step is the process where all of
the quality-filtered sequences are collapsed into a set of unique reads
(individual sequence units ISUs) with the information of their abundance
in the samples. After that, chimeras (i.e. compound sequences formed
from pieces of mixed origin) are detected and removed. Finally, the
sequences are clustered into OTUs (Operational Taxonomic Units), using
one of many clustering strategies. The most frequently used
bioinformatic softwares include Mothur, Uparse, Qiime, Galaxy, Obitools, JAMP, and DADA2.
Comparing the abundance of reads, i.e. sequences, between
different samples is still a challenge because both the total number of
reads in a sample as well as the relative amount of reads for a species
can vary between samples, methods, or other variables. For comparison,
one may then reduce the number of reads of each sample to the minimal
number of reads of the samples to be compared – a process called
rarefaction. Another way is to use the relative abundance of reads.
Species identification and taxonomic assignment
The taxonomic assignment of the OTUs to species is achieved by matching of sequences to reference libraries.
The Basic Local Alignment Search Tool (BLAST)
is commonly used to identify regions of similarity between sequences by
comparing sequence reads from the sample to sequences in reference
databases.
If the reference database contains sequences of the relevant species,
then the sample sequences can be identified to species level. If a
sequence cannot be matched to an existing reference library entry, DNA
barcoding can be used to create a new entry.
In some cases, due to the incompleteness of reference databases,
identification can only be achieved at higher taxonomic levels, such as
assignment to a family or class. In some organism groups such as
bacteria, taxonomic assignment to species level is often not possible.
In such cases, a sample may be assigned to a particular
operational taxonomic unit (OTU).
Applications
Applications of DNA barcoding include identification of new
species,
safety assessment of food, identification and assessment of cryptic
species, detection of alien species, identification of endangered and
threatened species,
linking egg and larval stages to adult species, securing intellectual
property rights for bioresources, framing global management plans for
conservation strategies and elucidate feeding niches. DNA barcode markers can be applied to address basic questions in systematics,
ecology,
evolutionary biology and
conservation, including community assembly,
species interaction networks, taxonomic discovery, and assessing priority areas for
environmental protection.
Identification of species
Specific
short DNA sequences or markers from a standardized region of the genome
can provide a DNA barcode for identifying species.
Molecular methods are especially useful when traditional methods are
not applicable. DNA barcoding has great applicability in identification
of larvae for which there are generally few diagnostic characters
available, and in association of different life stages (e.g. larval and
adult) in many animals. Identification of species listed in the Convention of the International Trade of Endangered Species (
CITES) appendixes using barcoding techniques is used in monitoring of illegal trade.
Detection of invasive species
Alien species can be detected via barcoding.
Barcoding can be suitable for detection of species in e.g. border
control, where rapid and accurate morphological identification is often
not possible due to similarities between different species, lack of
sufficient diagnostic characteristics and/or lack of taxonomic expertise. Barcoding and metabarcoding can also be used to screen
ecosystems for invasive species, and to distinguish between an invasive species and native, morphologically similar, species.
Delimiting cryptic species
DNA barcoding enables the identification and recognition of
cryptic species.
The results of DNA barcoding analyses depend however upon the choice of
analytical methods, so the process of delimiting cryptic species using
DNA barcodes can be as subjective as any other form of
taxonomy. Hebert
et al.(2004) concluded that the butterfly
Astraptes fulgerator in north-western Costa Rica actually consists of 10 different species.
These results, however, were subsequently challenged by Brower (2006),
who pointed out numerous serious flaws in the analysis, and concluded
that the original data could support no more than the possibility of
three to seven cryptic
taxa rather than ten cryptic species. Smith et al. (2007) used cytochrome
c oxidase I DNA barcodes for species identification of the 20 morphospecies of
Belvosia parasitoid flies (
Diptera:
Tachinidae) reared from caterpillars (
Lepidoptera)
in Area de Conservación Guanacaste (ACG), northwestern Costa Rica.
These authors discovered that barcoding raises the species count to 32,
by revealing that each of the three
parasitoid species, previously considered as generalists, actually are arrays of highly host-specific cryptic species. For 15 morphospecies of
polychaetes within the deep
Antarctic benthos
studied through DNA barcoding, cryptic diversity was found in 50% of
the cases. Furthermore, 10 previously overlooked morphospecies were
detected, increasing the total
species richness in the sample by 233%.
Barcoding
is a tool to vouch for food quality. Here, DNA from traditional
Norwegian Christmas food is extracted at the molecular systematic lab
at NTNU University Museum.
Diet analysis and food web application
DNA barcoding and metabarcoding can be useful in diet analysis studies, and is typically used if prey specimens cannot be identified based on morphological characters. There is a range of sampling approaches in diet analysis: DNA metabarcoding can be conducted on stomach contents, feces, saliva or whole body analysis.
In fecal samples or highly digested stomach contents, it is often not
possible to distinguish tissue from single species, and therefore
metabarcoding can be applied instead.
Feces or saliva represent non-invasive sampling approaches, while whole
body analysis often means that the individual needs to be killed first.
For smaller organisms, sequencing for stomach content is then often
done by sequencing the entire animal.
Barcoding for food safety
DNA
barcoding represents an essential tool to evaluate the quality of food
products. The purpose is to guarantee food traceability, to minimize
food piracy, and to valuate local and typical agro-food production.
Another purpose is to safeguard public health; for example,
metabarcoding offers the possibility to identify
groupers causing
Ciguatera fish poisoning from meal remnants, or to separate poisonous mushrooms from edible ones (Ref).
Biomonitoring and ecological assessment
DNA
barcoding can be used to assess the presence of endangered species for
conservation efforts (Ref), or the presence of indicator species
reflective to specific ecological conditions (Ref), for example excess
nutrients or low oxygen levels.
Potentials and shortcomings
Potentials
Traditional
bioassessment methods are well established internationally, and serve
biomonitoring well, as for example for aquatic bioassessment within the
EU Directives
WFD and
MSFD.
However, DNA barcoding could improve traditional methods for the
following reasons; DNA barcoding (i) can increase taxonomic resolution
and harmonize the identification of taxa which are difficult to identify
or lack experts, (ii) can more accurately/precisely relate
environmental factors to specific taxa (iii) can increase comparability
among regions, (iv) allows for the inclusion of early life stages and
fragmented specimens, (v) allows delimitation of
cryptic/rare species (vi) allows for development of new indices e.g. rare/cryptic species which may be sensitive/tolerant to
stressors,
(vii) increases the number of samples which can be processed and
reduces processing time resulting in increased knowledge of species
ecology, (viii) is a non-invasive way of monitoring when using
eDNA methods.
Time and cost
DNA
barcoding is faster than traditional morphological methods all the way
from training through to taxonomic assignment. It takes less time to
gain expertise in DNA methods than becoming an expert in taxonomy. In
addition, the DNA barcoding workflow (i.e. from sample to result) is
generally quicker than traditional morphological workflow and allows the
processing of more samples.
Taxonomic resolution
DNA
barcoding allows the resolution of taxa from higher (e.g. family) to
lower (e.g. species) taxonomic levels, that are otherwise too difficult
to identify using traditional morphological methods, like e.g.
identification via microscopy. For example,
Chironomidae
(the non-biting midge) are widely distributed in both terrestrial and
freshwater ecosystems. Their richness and abundance make them important
for ecological processes and networks, and they are one of many
invertebrate groups used in biomonitoring. Invertebrate samples can
contain as many as 100 species of chironomids which often make up as
much as 50% of a sample. Despite this, they are usually not identified
below the family level because of the taxonomic expertise and time
required.
This may result in different chironomid species with different
ecological preferences grouped together, resulting in inaccurate
assessment of water quality.
DNA barcoding provides the opportunity to resolve taxa, and
directly relate stressor effects to specific taxa such as individual
chironomid species. For example, Beermann et al. (2018) DNA barcoded
Chironomidae to investigate their response to multiple stressors;
reduced flow, increased fine-sediment and increased salinity. After barcoding, it was found that the chironomid sample consisted of 183
Operational Taxonomic Units
(OTUs), i.e. barcodes (sequences) that are often equivalent to
morphological species. These 183 OTUs displayed 15 response types rather
than the previously reported
two response types recorded when all chironomids were grouped together
in the same multiple stressor study. A similar trend was discovered in a
study by Macher et al. (2016) which discovered cryptic diversity within
the New Zealand mayfly species
Deleatidium sp.
This study found different response patterns of 12 molecular distinct
OTUs to stressors which may change the consensus that this mayfly is
sensitive to pollution.
Shortcomings
Despite
the advantages offered by DNA barcoding, it has also been suggested
that DNA barcoding is best used as a complement to traditional
morphological methods. This recommendation is based on multiple perceived challenges.
Physical parameters
It
is not completely straightforward to connect DNA barcodes with
ecological preferences of the barcoded taxon in question, as is needed
if barcoding is to be used for biomonitoring. For example, detecting
target DNA in aquatic systems depends on the concentration of DNA
molecules at a site, which in turn can be affected by many factors. The
presence of DNA molecules also depends on dispersion at a site, e.g.
direction or strength of currents. It is not really known how DNA moves
around in streams and lakes, which makes sampling difficult. Another
factor might be the behavior of the target species, e.g. fish can have
seasonal changes of movements, crayfish or mussels will release DNA in
larger amounts just at certain times of their life (moulting, spawning).
For DNA in soil, even less is known about distribution, quantity or
quality.
Incomplete barcode reference libraries
The
major limitation of the barcoding method is that it relies on barcode
reference libraries for the taxonomic identification of the sequences.
The taxonomic identification is accurate only if a reliable reference is
available. However, most databases are still incomplete, especially for
smaller organisms e.g. fungi, phytoplankton, nematoda etc. In addition,
current databases contain misidentifications, spelling mistakes and
other errors. There is massive curation and completion effort around the
databases for all organisms necessary, involving large barcoding
projects (for example the iBOL project for the Barcode of Life Data
Systems (BOLD) reference database).
However, completion and curation are difficult and time-consuming.
Without vouchered specimens, there can be no certainty about whether the
sequence used as a reference is correct. DNA sequence databases like
GenBank contain many sequences that are not tied to
vouchered specimens
(for example, herbarium specimens, cultured cell lines, or sometimes
images). This is problematic in the face of taxonomic issues such as
whether several species should be split or combined, or whether past
identifications were sound. Therefore, best practice for DNA barcoding
is to sequence vouchered specimens.
For many taxa, it can be however difficult to obtain reference
specimens, for example with specimens that are difficult to catch,
available specimens are poorly conserved, or adequate taxonomic
expertise is lacking. Importantly, DNA barcodes can also be used to
create interim taxonomy, in which case OTUs can be used as substitutes
for traditional Latin binomials – thus significantly reducing dependency
on fully populated reference databases.
Technological bias
DNA barcoding also carries methodological bias, from sampling to
bioinformatics
data analysis. Beside the risk of contamination of the DNA sample by
PCR inhibitors, primer bias is one of the major sources of errors in DNA
barcoding.
The isolation of an efficient DNA marker and the design of primers is a
complex process and considerable effort has been made to develop
primers for DNA barcoding in different taxonomic groups.
However, primers will often bind preferentially to some sequences,
leading to differential primer efficiency and specificity and
unrepresentative communities’ assessment and richness inflation.
Thus, the composition of the sample's communities sequences is mainly
altered at the PCR step. Besides, PCR replication is often required,
but leads to an exponential increase in the risk of contamination.
Several studies have highlighted the possibility to use
mitochondria-enriched samples or PCR-free approaches to avoid these biases, but as of today, the DNA
metabarcoding technique is still based on the sequencing of amplicons.
Other bias enter the picture during the sequencing and during the
bioinformatic processing of the sequences, like the creation of
chimeras.
Lack of standardization
Even
as DNA barcoding is more widely used and applied, there is no agreement
concerning the methods for DNA preservation or extraction, the choices
of DNA markers and primers set, or PCR protocols. The parameters of
bioinformatics
pipelines
(for example OTU clustering, taxonomic assignment algorithms or
thresholds etc.) are at the origin of much debate among DNA barcoding
users.
Sequencing technologies are also rapidly evolving, together with the
tools for the analysis of the massive amounts of DNA data generated, and
standardization of the methods is urgently needed to enable
collaboration and data sharing at greater spatial and time-scale. This
standardisation of barcoding methods at the European scale is part of
the objectives of the European COST Action DNAqua-net and is also addressed by CEN (the European Committee for Standardization).
Another criticism of DNA barcoding is its limited efficiency for
accurate discrimination below species level (for example, to distinguish
between varieties), for hybrid detection, and that it can be affected
by evolutionary rates (Ref needed).
Mismatches between conventional (morphological) and barcode based identification
It
is important to know that taxa lists derived by conventional
(morphological) identification are not, and maybe never will be,
directly comparable to taxa lists derived from barcode based
iendtification because of several reasons. The most important cause is
probably the incompleteness and lack of accuracy of the molecular
reference databases preventing a correct taxonomic assignment of eDNA
sequences. Taxa not present in reference databases will not be found by
eDNA, and sequences linked to a wrong name will lead to incorrect
identification.
Other known causes are a different sampling scale and size between a
traditional and a molecular sample, the possible analysis of dead
organisms, which can happen in different ways for both methods depending
on organism group, and the specific selection of identification in
either method, i.e. varying taxonomical expertise or possibility to
identify certain organism groups, respectively primer bias leading also
to a potential biased analysis of taxa.
Estimates of richness/diversity
DNA
Barcoding can result in an over or underestimate of species richness
and diversity. Some studies suggest that artifacts (identification of
species not present in a community) are a major cause of inflated
biodiversity.
The most problematic issue are taxa represented by low numbers of
sequencing reads. These reads are usually removed during the data
filtering process, since different studies suggest that most of these
low-frequency reads may be artifacts. However, real rare taxa may exist among these low-abundance reads.
Rare sequences can reflect unique lineages in communities which make
them informative and valuable sequences. Thus, there is a strong need
for more robust bioinformatics algorithms that allow the differentiation
between informative reads and artifacts. Complete reference libraries
would also allow a better testing of bioinformatics algorithms, by
permitting a better filtering of artifacts (i.e. the removal of
sequences lacking a counterpart among extant species) and therefore, it
would be possible obtain a more accurate species assignment.
Cryptic diversity can also result in inflated biodiversity as one
morphological species may actually split into many distinct molecular
sequences.
DNA metabarcoding
Differences
in the standard methods for DNA barcoding & metabarcoding. While
DNA barcoding points to find a specific species, metabarcoding looks for
the whole community.
DNA metabarcoding is defined as the barcoding of
DNA or
eDNA
(environmental DNA) that allows for simultaneous identification of many
taxa within the same (environmental) sample, however often within the
same organism group. The main difference between the approaches is that
metabarcoding, in contrast to barcoding, does not focus on one specific
organism, but instead aims to determine species composition within a
sample.
Methodology
The metabarcoding procedure, like general barcoding, covers the steps of
DNA extraction,
PCR amplification,
sequencing and
data analysis. A barcode consists of a short variable
gene region (for example, see
different markers/barcodes) which is useful for taxonomic assignment flanked by highly conserved gene regions which can be used for
primer design.
Different genes are used depending if the aim is to barcode single
species or metabarcoding several species. In the latter case, a more
universal gene is used. Metabarcoding does not use single species
DNA/RNA as a starting point, but DNA/RNA from several different
organisms derived from one environmental or bulk sample.
Applications
Metabarcoding
has the potential to complement biodiversity measures, and even replace
them in some instances, especially as the technology advances and
procedures gradually become cheaper, more optimized and widespread.
DNA metabarcoding applications include:
Advantages and challenges
The
general advantages and shortcomings for barcoding reviewed above are
valid also for metabarcoding. One particular drawback for metabarcoding
studies is that there is no consensus yet regarding the optimal
experimental design and bioinformatics criteria to be applied in eDNA
metabarcoding. However, there are current joined attempts, like e.g. the EU COST network
DNAqua-Net, to move forward by exchanging experience and knowledge to establish best-practice standards for biomonitoring.