Search This Blog

Sunday, March 20, 2022

DNA barcoding

From Wikipedia, the free encyclopedia

DNA barcoding scheme

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections (also called "sequences"), an individual sequence can be used to uniquely identify an organism to species, in the same way that a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database. These "barcodes" are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries.

Different gene regions are used to identify the different organismal groups using barcoding. The most commonly used barcode region for animals and some protists is a portion of the cytochrome c oxidase I (COI or COX1) gene, found in mitochondrial DNA. Other genes suitable for DNA barcoding are the internal transcribed spacer (ITS) rRNA often used for fungi and RuBisCO used for plants. Microorganisms are detected using different gene regions. The 16S rRNA gene for example is widely used in identification of prokaryotes, whereas the 18S rRNA gene is mostly used for detecting microbial eukaryotes. These gene regions are chosen because they have less intraspecific (within species) variation than interspecific (between species) variation, which is known as the "Barcoding Gap".

Some applications of DNA barcoding include: identifying plant leaves even when flowers or fruits are not available; identifying pollen collected on the bodies of pollinating animals; identifying insect larvae which may have fewer diagnostic characters than adults; or investigating the diet of an animal based on its stomach content, saliva or feces. When barcoding is used to identify organisms from a sample containing DNA from more than one organism, the term DNA metabarcoding is used, e.g. DNA metabarcoding of diatom communities in rivers and streams, which is used to assess water quality.

Background

DNA barcoding techniques were developed from early DNA sequencing work on microbial communities using the 5S rRNA gene. In 2003, specific methods and terminology of modern DNA barcoding were proposed as a standardized method for identifying species, as well as potentially allocating unknown sequences to higher taxa such as orders and phyla, in a paper by Paul D.N. Hebert et al. from the University of Guelph, Ontario, Canada. Hebert and his colleagues demonstrated the utility of the cytochrome c oxidase I (COI) gene, first utilized by Folmer et al. in 1994, using their published DNA primers as a tool for phylogenetic analyses at the species levels as a suitable discriminatory tool between metazoan invertebrates. The "Folmer region" of the COI gene is commonly used for distinction between taxa based on its patterns of variation at the DNA level. The relative ease of retrieving the sequence, and variability mixed with conservation between species, are some of the benefits of COI. Calling the profiles "barcodes", Hebert et al. envisaged the development of a COI database that could serve as the basis for a "global bioidentification system".

Methodology

Sampling and preservation

Barcoding can be done from tissue from a target specimen, from a mixture of organisms (bulk sample), or from DNA present in environmental samples (e.g. water or soil). The methods for sampling, preservation or analysis differ between those different types of sample.

Tissue samples

To barcode a tissue sample from the target specimen, a small piece of skin, a scale, a leg or antenna is likely to be sufficient (depending on the size of the specimen). To avoid contamination, it is necessary to sterilize used tools between samples. It is recommended to collect two samples from one specimen, one to archive, and one for the barcoding process. Sample preservation is crucial to overcome the issue of DNA degradation.

Bulk samples

A bulk sample is a type of environmental sample containing several organisms from the taxonomic group under study. The difference between bulk samples (in the sense used here) and other environmental samples is that the bulk sample usually provides a large quantity of good-quality DNA. Examples of bulk samples include aquatic macroinvertebrate samples collected by kick-net, or insect samples collected with a Malaise trap. Filtered or size-fractionated water samples containing whole organisms like unicellular eukaryotes are also sometimes defined as bulk samples. Such samples can be collected by the same techniques used to obtain traditional samples for morphology-based identification.

eDNA samples

The environmental DNA (eDNA) method is a non-invasive approach to detect and identify species from cellular debris or extracellular DNA present in environmental samples (e.g. water or soil) through barcoding or metabarcoding. The approach is based on the fact that every living organism leaves DNA in the environment, and this environmental DNA can be detected even for organisms that are at very low abundance. Thus, for field sampling, the most crucial part is to use DNA-free material and tools on each sampling site or sample to avoid contamination, if the DNA of the target organism(s) is likely to be present in low quantities. On the other hand, an eDNA sample always includes the DNA of whole-cell, living microorganisms, which are often present in large quantities. Therefore, microorganism samples taken in the natural environment also are called eDNA samples, but contamination is less problematic in this context due to the large quantity of target organisms. The eDNA method is applied on most sample types, like water, sediment, soil, animal feces, stomach content or blood from e.g. leeches.

DNA extraction, amplification and sequencing

DNA barcoding requires that DNA in the sample is extracted. Several different DNA extraction methods exist, and factors like cost, time, sample type and yield affect the selection of the optimal method.

When DNA from organismal or eDNA samples is amplified using polymerase chain reaction (PCR), the reaction can be affected negatively by inhibitor molecules contained in the sample. Removal of these inhibitors is crucial to ensure that high quality DNA is available for subsequent analyzing.

Amplification of the extracted DNA is a required step in DNA barcoding. Typically, only a small fragment of the total DNA material is sequenced (typically 400–800 base pairs) to obtain the DNA barcode. Amplification of eDNA material is usually focused on smaller fragment sizes (<200 base pairs), as eDNA is more likely to be fragmented than DNA material from other sources. However, some studies argue that there is no relationship between amplicon size and detection rate of eDNA.

HiSeq sequencers at SciLIfeLab in Uppsala, Sweden. The photo was taken during the excursion of SLU course PNS0169 in March 2019.

When the DNA barcode marker region has been amplified, the next step is to sequence the marker region using DNA sequencing methods. Many different sequencing platforms are available, and technical development is proceeding rapidly.

Marker selection

A schematic view of primers and target region, demonstrated on 16S rRNA gene in Pseudomonas. As primers, one typically selects short conserved sequences with low variability, which can thus amplify most or all species in the chosen target group. The primers are used to amplify a highly variable target region in between the two primers, which is then used for species discrimination. Modified from »Variable Copy Number, Intra-Genomic Heterogeneities and Lateral Transfers of the 16S rRNA Gene in Pseudomonas« by Bodilis, Josselin; Nsigue-Meilo, Sandrine; Besaury, Ludovic; Quillet, Laurent, used under CC BY, available from: https://www.researchgate.net/figure/Hypervariable-regions-within-the-16S-rRNA-gene-in-Pseudomonas-The-plotted-line-reflects_fig2_224832532.

Markers used for DNA barcoding are called barcodes. In order to successfully characterize species based on DNA barcodes, selection of informative DNA regions is crucial. A good DNA barcode should have low intra-specific and high inter-specific variability and possess conserved flanking sites for developing universal PCR primers for wide taxonomic application. The goal is to design primers that will detect and distinguish most or all the species in the studied group of organisms (high taxonomic resolution). The length of the barcode sequence should be short enough to be used with current sampling source, DNA extraction, amplification and sequencing methods.

Ideally, one gene sequence would be used for all taxonomic groups, from viruses to plants and animals. However, no such gene region has been found yet, so different barcodes are used for different groups of organisms, or depending on the study question.

For animals, the most widely used barcode is mitochondrial cytochrome C oxidase I (COI) locus. Other mitochondrial genes, such as Cytb, 12S or 18S are also used. Mitochondrial genes are preferred over nuclear genes because of their lack of introns, their haploid mode of inheritance and their limited recombination. Moreover, each cell has various mitochondria (up to several thousand) and each of them contains several circular DNA molecules. Mitochondria can therefore offer abundant source of DNA even when sample tissue is limited.

In plants, however, mitochondrial genes are not appropriate for DNA barcoding because they exhibit low mutation rates. A few candidate genes have been found in the chloroplast genome, the most promising being maturase K gene (matK) by itself or in association with other genes. Multi-locus markers such as ribosomal internal transcribed spacers (ITS DNA) along with matK, rbcL, trnH or other genes have also been used for species identification. The best discrimination between plant species has been achieved when using two or more chloroplast barcodes.

For bacteria, the small subunit of ribosomal RNA (16S) gene can be used for different taxa, as it is highly conserved. Some studies suggest COI, type II chaperonin (cpn60) or β subunit of RNA polymerase (rpoB) also could serve as bacterial DNA barcodes.

Barcoding fungi is more challenging, and more than one primer combination might be required. The COI marker performs well in certain fungi groups, but not equally well in others. Therefore, additional markers are being used, such as ITS rDNA and the large subunit of nuclear ribosomal RNA (28S LSU rRNA).

Within the group of protists, various barcodes have been proposed, such as the D1–D2 or D2–D3 regions of 28S rDNA, V4 subregion of 18S rRNA gene, ITS rDNA and COI. Additionally, some specific barcodes can be used for photosynthetic protists, for example the large subunit of ribulose-1,5-bisphosphate carboxylase-oxygenase gene (rbcL) and the chloroplastic 23S rRNA gene.

Markers that have been used for DNA barcoding in different organism groups, modified from Purty and Chatterjee.
Organism group Marker gene/locus
Animals COI, Cytb, 12S, 16S
Plants matK, rbcL, psbA-trnH, ITS
Bacteria COI, rpoB, 16S, cpn60, tuf, RIF, gnd
Fungi ITS, TEF1α, RPB1, RPB2, 18S, 28S
Protists ITS, COI, rbcL, 18S, 28S, 23S

Reference libraries and bioinformatics

Reference libraries are used for the taxonomic identification, also called annotation, of sequences obtained from barcoding or metabarcoding. These databases contain the DNA barcodes assigned to previously identified taxa. Most reference libraries do not cover all species within an organism group, and new entries are continually created. In the case of macro- and many microorganisms (such as algae), these reference libraries require detailed documentation (sampling location and date, person who collected it, image, etc.) and authoritative taxonomic identification of the voucher specimen, as well as submission of sequences in a particular format. However, such standards are fulfilled for only a small number of species. The process also requires the storage of voucher specimens in museum collections, herbaria and other collaborating institutions. Both taxonomically comprehensive coverage and content quality are important for identification accuracy. In the microbial world, there is no DNA information for most species names, and many DNA sequences cannot be assigned to any Linnaean binomial. Several reference databases exist depending on the organism group and the genetic marker used. There are smaller, national databases (e.g. FinBOL), and large consortia like the International Barcode of Life Project (iBOL).

BOLD

Launched in 2007, the Barcode of Life Data System (BOLD) is one of the biggest databases, containing more than 450 000 BINs (Barcode Index Numbers) in 2019. It is a freely accessible repository for the specimen and sequence records for barcode studies, and it is also a workbench aiding the management, quality assurance and analysis of barcode data. The database mainly contains BIN records for animals based on the COI genetic marker.

UNITE

The UNITE database was launched in 2003 and is a reference database for the molecular identification of fungal species with the internal transcribed spacer (ITS) genetic marker region. This database is based on the concept of species hypotheses: you choose the % at which you want to work, and the sequences are sorted in comparison to sequences obtained from voucher specimens identified by experts.

Diat.barcode

Diat.barcode database was first published under the name R-syst::diatom in 2016 starting with data from two sources: the Thonon culture collection (TCC) in the hydrobiological station of the French National Institute for Agricultural Research (INRA), and from the NCBI (National Center for Biotechnology Information) nucleotide database. Diat.barcode provides data for two genetic markers, rbcL (Ribulose-1,5-bisphosphate carboxylase/oxygenase) and 18S (18S ribosomal RNA). The database also involves additional, trait information of species, like morphological characteristics (biovolume, size dimensions, etc.), life-forms (mobility, colony-type, etc.) or ecological features (pollution sensitivity, etc.).

Bioinformatic analysis

In order to obtain well structured, clean and interpretable data, raw sequencing data must be processed using bioinformatic analysis. The FASTQ file with the sequencing data contains two types of information: the sequences detected in the sample (FASTA file) and a quality file with quality scores (PHRED scores) associated with each nucleotide of each DNA sequence. The PHRED scores indicate the probability with which the associated nucleotide has been correctly scored.

PHRED quality score and the associated certainty level
10 90%
20 99%
30 99.9%
40 99.99%
50 99.999%

In general, the PHRED score decreases towards the end of each DNA sequence. Thus some bioinformatics pipelines simply cut the end of the sequences at a defined threshold.

Some sequencing technologies, like MiSeq, use paired-end sequencing during which sequencing is performed from both directions producing better quality. The overlapping sequences are then aligned into contigs and merged. Usually, several samples are pooled in one run, and each sample is characterized by a short DNA fragment, the tag. In a demultiplexing step, sequences are sorted using these tags to reassemble the separate samples. Before further analysis, tags and other adapters are removed from the barcoding sequence DNA fragment. During trimming, the bad quality sequences (low PHRED scores), or sequences that are much shorter or longer than the targeted DNA barcode, are removed. The following dereplication step is the process where all of the quality-filtered sequences are collapsed into a set of unique reads (individual sequence units ISUs) with the information of their abundance in the samples. After that, chimeras (i.e. compound sequences formed from pieces of mixed origin) are detected and removed. Finally, the sequences are clustered into OTUs (Operational Taxonomic Units), using one of many clustering strategies. The most frequently used bioinformatic software include Mothur, Uparse, Qiime, Galaxy, Obitools, JAMP, Barque, and DADA2.

Comparing the abundance of reads, i.e. sequences, between different samples is still a challenge because both the total number of reads in a sample as well as the relative amount of reads for a species can vary between samples, methods, or other variables. For comparison, one may then reduce the number of reads of each sample to the minimal number of reads of the samples to be compared – a process called rarefaction. Another way is to use the relative abundance of reads.

Species identification and taxonomic assignment

The taxonomic assignment of the OTUs to species is achieved by matching of sequences to reference libraries. The Basic Local Alignment Search Tool (BLAST) is commonly used to identify regions of similarity between sequences by comparing sequence reads from the sample to sequences in reference databases. If the reference database contains sequences of the relevant species, then the sample sequences can be identified to species level. If a sequence cannot be matched to an existing reference library entry, DNA barcoding can be used to create a new entry.

In some cases, due to the incompleteness of reference databases, identification can only be achieved at higher taxonomic levels, such as assignment to a family or class. In some organism groups such as bacteria, taxonomic assignment to species level is often not possible. In such cases, a sample may be assigned to a particular operational taxonomic unit (OTU).

Applications

Applications of DNA barcoding include identification of new species, safety assessment of food, identification and assessment of cryptic species, detection of alien species, identification of endangered and threatened species, linking egg and larval stages to adult species, securing intellectual property rights for bioresources, framing global management plans for conservation strategies and elucidate feeding niches. DNA barcode markers can be applied to address basic questions in systematics, ecology, evolutionary biology and conservation, including community assembly, species interaction networks, taxonomic discovery, and assessing priority areas for environmental protection.

Identification of species

Specific short DNA sequences or markers from a standardized region of the genome can provide a DNA barcode for identifying species. Molecular methods are especially useful when traditional methods are not applicable. DNA barcoding has great applicability in identification of larvae for which there are generally few diagnostic characters available, and in association of different life stages (e.g. larval and adult) in many animals. Identification of species listed in the Convention of the International Trade of Endangered Species (CITES) appendixes using barcoding techniques is used in monitoring of illegal trade.

Detection of invasive species

Alien species can be detected via barcoding. Barcoding can be suitable for detection of species in e.g. border control, where rapid and accurate morphological identification is often not possible due to similarities between different species, lack of sufficient diagnostic characteristics and/or lack of taxonomic expertise. Barcoding and metabarcoding can also be used to screen ecosystems for invasive species, and to distinguish between an invasive species and native, morphologically similar, species.

Delimiting cryptic species

DNA barcoding enables the identification and recognition of cryptic species. The results of DNA barcoding analyses depend however upon the choice of analytical methods, so the process of delimiting cryptic species using DNA barcodes can be as subjective as any other form of taxonomy. Hebert et al. (2004) concluded that the butterfly Astraptes fulgerator in north-western Costa Rica actually consists of 10 different species. These results, however, were subsequently challenged by Brower (2006), who pointed out numerous serious flaws in the analysis, and concluded that the original data could support no more than the possibility of three to seven cryptic taxa rather than ten cryptic species. Smith et al. (2007) used cytochrome c oxidase I DNA barcodes for species identification of the 20 morphospecies of Belvosia parasitoid flies (Diptera: Tachinidae) reared from caterpillars (Lepidoptera) in Area de Conservación Guanacaste (ACG), northwestern Costa Rica. These authors discovered that barcoding raises the species count to 32, by revealing that each of the three parasitoid species, previously considered as generalists, actually are arrays of highly host-specific cryptic species. For 15 morphospecies of polychaetes within the deep Antarctic benthos studied through DNA barcoding, cryptic diversity was found in 50% of the cases. Furthermore, 10 previously overlooked morphospecies were detected, increasing the total species richness in the sample by 233%.

Barcoding is a tool to vouch for food quality. Here, DNA from traditional Norwegian Christmas food is extracted at the molecular systematic lab at NTNU University Museum.

Diet analysis and food web application

DNA barcoding and metabarcoding can be useful in diet analysis studies, and is typically used if prey specimens cannot be identified based on morphological characters. There is a range of sampling approaches in diet analysis: DNA metabarcoding can be conducted on stomach contents, feces, saliva or whole body analysis. In fecal samples or highly digested stomach contents, it is often not possible to distinguish tissue from single species, and therefore metabarcoding can be applied instead. Feces or saliva represent non-invasive sampling approaches, while whole body analysis often means that the individual needs to be killed first. For smaller organisms, sequencing for stomach content is then often done by sequencing the entire animal.

Barcoding for food safety

DNA barcoding represents an essential tool to evaluate the quality of food products. The purpose is to guarantee food traceability, to minimize food piracy, and to valuate local and typical agro-food production. Another purpose is to safeguard public health; for example, metabarcoding offers the possibility to identify groupers causing Ciguatera fish poisoning from meal remnants, or to separate poisonous mushrooms from edible ones (Ref).

Biomonitoring and ecological assessment

DNA barcoding can be used to assess the presence of endangered species for conservation efforts (Ref), or the presence of indicator species reflective to specific ecological conditions (Ref), for example excess nutrients or low oxygen levels.

Potentials and shortcomings

Potentials

Traditional bioassessment methods are well established internationally, and serve biomonitoring well, as for example for aquatic bioassessment within the EU Directives WFD and MSFD. However, DNA barcoding could improve traditional methods for the following reasons; DNA barcoding (i) can increase taxonomic resolution and harmonize the identification of taxa which are difficult to identify or lack experts, (ii) can more accurately/precisely relate environmental factors to specific taxa (iii) can increase comparability among regions, (iv) allows for the inclusion of early life stages and fragmented specimens, (v) allows delimitation of cryptic/rare species (vi) allows for development of new indices e.g. rare/cryptic species which may be sensitive/tolerant to stressors, (vii) increases the number of samples which can be processed and reduces processing time resulting in increased knowledge of species ecology, (viii) is a non-invasive way of monitoring when using eDNA methods.

Time and cost

DNA barcoding is faster than traditional morphological methods all the way from training through to taxonomic assignment. It takes less time to gain expertise in DNA methods than becoming an expert in taxonomy. In addition, the DNA barcoding workflow (i.e. from sample to result) is generally quicker than traditional morphological workflow and allows the processing of more samples.

Taxonomic resolution

DNA barcoding allows the resolution of taxa from higher (e.g. family) to lower (e.g. species) taxonomic levels, that are otherwise too difficult to identify using traditional morphological methods, like e.g. identification via microscopy. For example, Chironomidae (the non-biting midge) are widely distributed in both terrestrial and freshwater ecosystems. Their richness and abundance make them important for ecological processes and networks, and they are one of many invertebrate groups used in biomonitoring. Invertebrate samples can contain as many as 100 species of chironomids which often make up as much as 50% of a sample. Despite this, they are usually not identified below the family level because of the taxonomic expertise and time required. This may result in different chironomid species with different ecological preferences grouped together, resulting in inaccurate assessment of water quality.

DNA barcoding provides the opportunity to resolve taxa, and directly relate stressor effects to specific taxa such as individual chironomid species. For example, Beermann et al. (2018) DNA barcoded Chironomidae to investigate their response to multiple stressors; reduced flow, increased fine-sediment and increased salinity. After barcoding, it was found that the chironomid sample consisted of 183 Operational Taxonomic Units (OTUs), i.e. barcodes (sequences) that are often equivalent to morphological species. These 183 OTUs displayed 15 response types rather than the previously reported  two response types recorded when all chironomids were grouped together in the same multiple stressor study. A similar trend was discovered in a study by Macher et al. (2016) which discovered cryptic diversity within the New Zealand mayfly species Deleatidium sp. This study found different response patterns of 12 molecular distinct OTUs to stressors which may change the consensus that this mayfly is sensitive to pollution.

Shortcomings

Despite the advantages offered by DNA barcoding, it has also been suggested that DNA barcoding is best used as a complement to traditional morphological methods. This recommendation is based on multiple perceived challenges.

Physical parameters

It is not completely straightforward to connect DNA barcodes with ecological preferences of the barcoded taxon in question, as is needed if barcoding is to be used for biomonitoring. For example, detecting target DNA in aquatic systems depends on the concentration of DNA molecules at a site, which in turn can be affected by many factors. The presence of DNA molecules also depends on dispersion at a site, e.g. direction or strength of currents. It is not really known how DNA moves around in streams and lakes, which makes sampling difficult. Another factor might be the behavior of the target species, e.g. fish can have seasonal changes of movements, crayfish or mussels will release DNA in larger amounts just at certain times of their life (moulting, spawning). For DNA in soil, even less is known about distribution, quantity or quality.

The major limitation of the barcoding method is that it relies on barcode reference libraries for the taxonomic identification of the sequences. The taxonomic identification is accurate only if a reliable reference is available. However, most databases are still incomplete, especially for smaller organisms e.g. fungi, phytoplankton, nematoda etc. In addition, current databases contain misidentifications, spelling mistakes and other errors. There is massive curation and completion effort around the databases for all organisms necessary, involving large barcoding projects (for example the iBOL project for the Barcode of Life Data Systems (BOLD) reference database). However, completion and curation are difficult and time-consuming. Without vouchered specimens, there can be no certainty about whether the sequence used as a reference is correct.

DNA sequence databases like GenBank contain many sequences that are not tied to vouchered specimens (for example, herbarium specimens, cultured cell lines, or sometimes images). This is problematic in the face of taxonomic issues such as whether several species should be split or combined, or whether past identifications were sound. Reusing sequences, not tied to vouchered specimens, of initially misidentified organism may support incorrect conclusions and must be avoided. Therefore, best practice for DNA barcoding is to sequence vouchered specimens. For many taxa, it can be however difficult to obtain reference specimens, for example with specimens that are difficult to catch, available specimens are poorly conserved, or adequate taxonomic expertise is lacking.

Importantly, DNA barcodes can also be used to create interim taxonomy, in which case OTUs can be used as substitutes for traditional Latin binomials – thus significantly reducing dependency on fully populated reference databases.

Technological bias

DNA barcoding also carries methodological bias, from sampling to bioinformatics data analysis. Beside the risk of contamination of the DNA sample by PCR inhibitors, primer bias is one of the major sources of errors in DNA barcoding. The isolation of an efficient DNA marker and the design of primers is a complex process and considerable effort has been made to develop primers for DNA barcoding in different taxonomic groups. However, primers will often bind preferentially to some sequences, leading to differential primer efficiency and specificity and unrepresentative communities’ assessment and richness inflation. Thus, the composition of the sample's communities sequences is mainly altered at the PCR step.  Besides, PCR replication is often required, but leads to an exponential increase in the risk of contamination. Several studies have highlighted the possibility to use mitochondria-enriched samples  or PCR-free approaches to avoid these biases, but as of today, the DNA metabarcoding technique is still based on the sequencing of amplicons. Other bias enter the picture during the sequencing and during the bioinformatic processing of the sequences, like the creation of chimeras.

Lack of standardization

Even as DNA barcoding is more widely used and applied, there is no agreement concerning the methods for DNA preservation or extraction, the choices of DNA markers and primers set, or PCR protocols. The parameters of bioinformatics pipelines (for example OTU clustering, taxonomic assignment algorithms or thresholds etc.) are at the origin of much debate among DNA barcoding users. Sequencing technologies are also rapidly evolving, together with the tools for the analysis of the massive amounts of DNA data generated, and standardization of the methods is urgently needed to enable collaboration and data sharing at greater spatial and time-scale. This standardisation of barcoding methods at the European scale is part of the objectives of the European COST Action DNAqua-net and is also addressed by CEN (the European Committee for Standardization).

Another criticism of DNA barcoding is its limited efficiency for accurate discrimination below species level (for example, to distinguish between varieties), for hybrid detection, and that it can be affected by evolutionary rates (Ref needed).

Mismatches between conventional (morphological) and barcode based identification

It is important to know that taxa lists derived by conventional (morphological) identification are not, and maybe never will be, directly comparable to taxa lists derived from barcode based identification because of several reasons. The most important cause is probably the incompleteness and lack of accuracy of the molecular reference databases preventing a correct taxonomic assignment of eDNA sequences. Taxa not present in reference databases will not be found by eDNA, and sequences linked to a wrong name will lead to incorrect identification. Other known causes are a different sampling scale and size between a traditional and a molecular sample, the possible analysis of dead organisms, which can happen in different ways for both methods depending on organism group, and the specific selection of identification in either method, i.e. varying taxonomical expertise or possibility to identify certain organism groups, respectively primer bias leading also to a potential biased analysis of taxa.

Estimates of richness/diversity

DNA Barcoding can result in an over or underestimate of species richness and diversity. Some studies suggest that artifacts (identification of species not present in a community) are a major cause of inflated biodiversity. The most problematic issue are taxa represented by low numbers of sequencing reads. These reads are usually removed during the data filtering process, since different studies suggest that most of these low-frequency reads may be artifacts. However, real rare taxa may exist among these low-abundance reads. Rare sequences can reflect unique lineages in communities which make them informative and valuable sequences. Thus, there is a strong need for more robust bioinformatics algorithms that allow the differentiation between informative reads and artifacts. Complete reference libraries would also allow a better testing of bioinformatics algorithms, by permitting a better filtering of artifacts (i.e. the removal of sequences lacking a counterpart among extant species) and therefore, it would be possible obtain a more accurate species assignment. Cryptic diversity can also result in inflated biodiversity as one morphological species may actually split into many distinct molecular sequences.

Metabarcoding

Differences in the standard methods for DNA barcoding and metabarcoding. While DNA barcoding points to find a specific species, metabarcoding looks for the whole community.
 

Metabarcoding is defined as the barcoding of DNA or eDNA (environmental DNA) that allows for simultaneous identification of many taxa within the same (environmental) sample, however often within the same organism group. The main difference between the approaches is that metabarcoding, in contrast to barcoding, does not focus on one specific organism, but instead aims to determine species composition within a sample.

Methodology

The metabarcoding procedure, like general barcoding, covers the steps of DNA extraction, PCR amplification, sequencing and data analysis. A barcode consists of a short variable gene region (for example, see different markers/barcodes) which is useful for taxonomic assignment flanked by highly conserved gene regions which can be used for primer design. Different genes are used depending if the aim is to barcode single species or metabarcoding several species. In the latter case, a more universal gene is used. Metabarcoding does not use single species DNA/RNA as a starting point, but DNA/RNA from several different organisms derived from one environmental or bulk sample.

Applications

Metabarcoding has the potential to complement biodiversity measures, and even replace them in some instances, especially as the technology advances and procedures gradually become cheaper, more optimized and widespread.

DNA metabarcoding applications include:

Advantages and challenges

The general advantages and shortcomings for barcoding reviewed above are valid also for metabarcoding. One particular drawback for metabarcoding studies is that there is no consensus yet regarding the optimal experimental design and bioinformatics criteria to be applied in eDNA metabarcoding. However, there are current joined attempts, like e.g. the EU COST network DNAqua-Net, to move forward by exchanging experience and knowledge to establish best-practice standards for biomonitoring.

Protein (nutrient)

From Wikipedia, the free encyclopedia

Amino acids are the building blocks of protein.
 
Amino acids are necessary nutrients. Present in every cell, they are also precursors to nucleic acids, co-enzymes, hormones, immune response, repair and other molecules essential for life.

Proteins are essential nutrients for the human body. They are one of the building blocks of body tissue and can also serve as a fuel source. As a fuel, proteins provide as much energy density as carbohydrates: 4 kcal (17 kJ) per gram; in contrast, lipids provide 9 kcal (37 kJ) per gram. The most important aspect and defining characteristic of protein from a nutritional standpoint is its amino acid composition.

Proteins are polymer chains made of amino acids linked together by peptide bonds. During human digestion, proteins are broken down in the stomach to smaller polypeptide chains via hydrochloric acid and protease actions. This is crucial for the absorption of the essential amino acids that cannot be biosynthesized by the body.

There are nine essential amino acids which humans must obtain from their diet in order to prevent protein–energy malnutrition and resulting death. They are phenylalanine, valine, threonine, tryptophan, methionine, leucine, isoleucine, lysine, and histidine. There has been debate as to whether there are 8 or 9 essential amino acids. The consensus seems to lean towards 9 since histidine is not synthesized in adults. There are five amino acids which humans are able to synthesize in the body. These five are alanine, aspartic acid, asparagine, glutamic acid and serine. There are six conditionally essential amino acids whose synthesis can be limited under special pathophysiological conditions, such as prematurity in the infant or individuals in severe catabolic distress. These six are arginine, cysteine, glycine, glutamine, proline and tyrosine. Dietary sources of protein include meats, dairy products, fish, eggs, grains, legumes, nuts and edible insects.

Protein functions in human body

Protein is a nutrient needed by the human body for growth and maintenance. Aside from water, proteins are the most abundant kind of molecules in the body. Protein can be found in all cells of the body and is the major structural component of all cells in the body, especially muscle. This also includes body organs, hair and skin. Proteins are also used in membranes, such as glycoproteins. When broken down into amino acids, they are used as precursors to nucleic acid, co-enzymes, hormones, immune response, cellular repair, and other molecules essential for life. Additionally, protein is needed to form blood cells.

Sources

Some sources of animal-based protein

 
Nutritional value and environmental impact of animal products, compared to agriculture overall
Categories Contribution of farmed animal product [%]
Calories
18
Proteins
37
Land use
83
Greenhouse gases
58
Water pollution
57
Air pollution
56
Freshwater withdrawals
33

Protein occurs in a wide range of food. On a worldwide basis, plant protein foods contribute over 60% of the per capita supply of protein. In North America, animal-derived foods contribute about 70% of protein sources. Insects are a source of protein in many parts of the world. In parts of Africa, up to 50% of dietary protein derives from insects. It is estimated that more than 2 billion people eat insects daily.

Meat, dairy, eggs, soy, fish, whole grains, and cereals are sources of protein. Examples of food staples and cereal sources of protein, each with a concentration greater than 7%, are (in no particular order) buckwheat, oats, rye, millet, maize (corn), rice, wheat, sorghum, amaranth, and quinoa. Some research highlights game meat as a protein source.

Vegan sources of proteins include legumes, nuts, seeds and fruits. Vegan foods with protein concentrations greater than 7% include soybeans, lentils, kidney beans, white beans, mung beans, chickpeas, cowpeas, lima beans, pigeon peas, lupines, wing beans, almonds, Brazil nuts, cashews, pecans, walnuts, cotton seeds, pumpkin seeds, hemp seeds, sesame seeds, and sunflower seeds.

Photovoltaic-driven microbial protein production uses electricity from solar panels and carbon dioxide from the air to create fuel for microbes, which are grown in bioreactor vats and then processed into dry protein powders. The process makes highly efficient use of land, water and fertiliser.

Plant sources of protein.

People eating a balanced diet do not need protein supplements.

The table below presents food groups as protein sources.

Food source Lysine Threonine Tryptophan Sulfur-containing
amino acids
Legumes 64 38 12 25
Cereals and whole grains 31 32 12 37
Nuts and seeds 45 36 17 46
Fruits 45 29 11 27
Animal 85 44 12 38

Colour key:

  Protein source with highest density of respective amino acid.
  Protein source with lowest density of respective amino acid.
Protein milkshakes, made from protein powder (center) and milk (left), are a common bodybuilding supplement

Protein powders – such as casein, whey, egg, rice, soy and cricket flour– are processed and manufactured sources of protein.

Testing in foods

The classic assays for protein concentration in food are the Kjeldahl method and the Dumas method. These tests determine the total nitrogen in a sample. The only major component of most food which contains nitrogen is protein (fat, carbohydrate and dietary fiber do not contain nitrogen). If the amount of nitrogen is multiplied by a factor depending on the kinds of protein expected in the food the total protein can be determined. This value is known as the "crude protein" content. On food labels the protein is given by the nitrogen multiplied by 6.25, because the average nitrogen content of proteins is about 16%. The Kjeldahl test is typically used because it is the method the AOAC International has adopted and is therefore used by many food standards agencies around the world, though the Dumas method is also approved by some standards organizations.

Accidental contamination and intentional adulteration of protein meals with non-protein nitrogen sources that inflate crude protein content measurements have been known to occur in the food industry for decades. To ensure food quality, purchasers of protein meals routinely conduct quality control tests designed to detect the most common non-protein nitrogen contaminants, such as urea and ammonium nitrate.

In at least one segment of the food industry, the dairy industry, some countries (at least the U.S., Australia, France and Hungary) have adopted "true protein" measurement, as opposed to crude protein measurement, as the standard for payment and testing: "True protein is a measure of only the proteins in milk, whereas crude protein is a measure of all sources of nitrogen and includes nonprotein nitrogen, such as urea, which has no food value to humans. ... Current milk-testing equipment measures peptide bonds, a direct measure of true protein." Measuring peptide bonds in grains has also been put into practice in several countries including Canada, the UK, Australia, Russia and Argentina where near-infrared reflectance (NIR) technology, a type of infrared spectroscopy is used. The Food and Agriculture Organization of the United Nations (FAO) recommends that only amino acid analysis be used to determine protein in, inter alia, foods used as the sole source of nourishment, such as infant formula, but also provides: "When data on amino acids analyses are not available, determination of protein based on total N content by Kjeldahl (AOAC, 2000) or similar method ... is considered acceptable."

The testing method for protein in beef cattle feed has grown into a science over the post-war years. The standard text in the United States, Nutrient Requirements of Beef Cattle, has been through eight editions over at least seventy years. The 1996 sixth edition substituted for the fifth edition's crude protein the concept of "metabolizeable protein", which was defined around the year 2000 as "the true protein absorbed by the intestine, supplied by microbial protein and undegraded intake protein".

The limitations of the Kjeldahl method were at the heart of the Chinese protein export contamination in 2007 and the 2008 China milk scandal in which the industrial chemical melamine was added to the milk or glutens to increase the measured "protein".

Protein quality

The most important aspect and defining characteristic of protein from a nutritional standpoint is its amino acid composition. There are multiple systems which rate proteins by their usefulness to an organism based on their relative percentage of amino acids and, in some systems, the digestibility of the protein source. They include biological value, net protein utilization, and PDCAAS (Protein Digestibility Corrected Amino Acids Score) which was developed by the FDA as a modification of the Protein efficiency ratio (PER) method. The PDCAAS rating was adopted by the US Food and Drug Administration (FDA) and the Food and Agricultural Organization of the United Nations/World Health Organization (FAO/WHO) in 1993 as "the preferred 'best'" method to determine protein quality. These organizations have suggested that other methods for evaluating the quality of protein are inferior. In 2013 FAO proposed changing to Digestible Indispensable Amino Acid Score.

Digestion

Most proteins are decomposed to single amino acids by digestion in the gastro-intestinal tract.

Digestion typically begins in the stomach when pepsinogen is converted to pepsin by the action of hydrochloric acid, and continued by trypsin and chymotrypsin in the small intestine. Before the absorption in the small intestine, most proteins are already reduced to single amino acid or peptides of several amino acids. Most peptides longer than four amino acids are not absorbed. Absorption into the intestinal absorptive cells is not the end. There, most of the peptides are broken into single amino acids.

Absorption of the amino acids and their derivatives into which dietary protein is degraded is done by the gastrointestinal tract. The absorption rates of individual amino acids are highly dependent on the protein source; for example, the digestibilities of many amino acids in humans, the difference between soy and milk proteins and between individual milk proteins, beta-lactoglobulin and casein. For milk proteins, about 50% of the ingested protein is absorbed between the stomach and the jejunum and 90% is absorbed by the time the digested food reaches the ileum. Biological value (BV) is a measure of the proportion of absorbed protein from a food which becomes incorporated into the proteins of the organism's body.

Newborn

Newborns of mammals are exceptional in protein digestion and assimilation in that they can absorb intact proteins at the small intestine. This enables passive immunity, i.e., transfer of immunoglobulins from the mother to the newborn, via milk.

Dietary requirements

An education campaign launched by the United States Department of Agriculture about 100 years ago, on cottage cheese as a lower-cost protein substitute for meat.

Considerable debate has taken place regarding issues surrounding protein intake requirements. The amount of protein required in a person's diet is determined in large part by overall energy intake, the body's need for nitrogen and essential amino acids, body weight and composition, rate of growth in the individual, physical activity level, the individual's energy and carbohydrate intake, and the presence of illness or injury. Physical activity and exertion as well as enhanced muscular mass increase the need for protein. Requirements are also greater during childhood for growth and development, during pregnancy, or when breastfeeding in order to nourish a baby or when the body needs to recover from malnutrition or trauma or after an operation.

Dietary recommendations

According to US & Canadian Dietary Reference Intake guidelines, women aged 19–70 need to consume 46 grams of protein per day while men aged 19–70 need to consume 56 grams of protein per day to minimize risk of deficiency. These Recommended Dietary Allowances (RDAs) were calculated based on 0.8 grams protein per kilogram body weight and average body weights of 57 kg (126 pounds) and 70 kg (154 pounds), respectively. However, this recommendation is based on structural requirements but disregards use of protein for energy metabolism. This requirement is for a normal sedentary person. In the United States, average protein consumption is higher than the RDA. According to results of the National Health and Nutrition Examination Survey (NHANES 2013-2014), average protein consumption for women ages 20 and older was 69.8 grams and for men 98.3 grams/day.

Active people

Several studies have concluded that active people and athletes may require elevated protein intake (compared to 0.8 g/kg) due to increase in muscle mass and sweat losses, as well as need for body repair and energy source. Suggested amounts vary from 1.2 to 1.4 g/kg for those doing endurance exercise to as much as 1.6-1.8 g/kg for strength exercise, while a proposed maximum daily protein intake would be approximately 25% of energy requirements i.e. approximately 2 to 2.5 g/kg. However, many questions still remain to be resolved.

In addition, some have suggested that athletes using restricted-calorie diets for weight loss should further increase their protein consumption, possibly to 1.8–2.0 g/kg, in order to avoid loss of lean muscle mass.

Aerobic exercise protein needs

Endurance athletes differ from strength-building athletes in that endurance athletes do not build as much muscle mass from training as strength-building athletes do. Research suggests that individuals performing endurance activity require more protein intake than sedentary individuals so that muscles broken down during endurance workouts can be repaired. Although the protein requirement for athletes still remains controversial (for instance see Lamont, Nutrition Research Reviews, pages 142 - 149, 2012), research does show that endurance athletes can benefit from increasing protein intake because the type of exercise endurance athletes participate in still alters the protein metabolism pathway. The overall protein requirement increases because of amino acid oxidation in endurance-trained athletes. Endurance athletes who exercise over a long period (2–5 hours per training session) use protein as a source of 5–10% of their total energy expended. Therefore, a slight increase in protein intake may be beneficial to endurance athletes by replacing the protein lost in energy expenditure and protein lost in repairing muscles. One review concluded that endurance athletes may increase daily protein intake to a maximum of 1.2–1.4 g per kg body weight.

Anaerobic exercise protein needs

Research also indicates that individuals performing strength training activity require more protein than sedentary individuals. Strength-training athletes may increase their daily protein intake to a maximum of 1.4–1.8 g per kg body weight to enhance muscle protein synthesis, or to make up for the loss of amino acid oxidation during exercise. Many athletes maintain a high-protein diet as part of their training. In fact, some athletes who specialize in anaerobic sports (e.g., weightlifting) believe a very high level of protein intake is necessary, and so consume high protein meals and also protein supplements.

Special populations

Protein allergies

A food allergy is an abnormal immune response to proteins in food. The signs and symptoms may range from mild to severe. They may include itchiness, swelling of the tongue, vomiting, diarrhea, hives, trouble breathing, or low blood pressure. These symptoms typically occurs within minutes to one hour after exposure. When the symptoms are severe, it is known as anaphylaxis. The following eight foods are responsible for about 90% of allergic reactions: cow's milk, eggs, wheat, shellfish, fish, peanuts, tree nuts and soy.

Chronic kidney disease

While there is no conclusive evidence that a high protein diet can cause chronic kidney disease, there is a consensus that people with this disease should decrease consumption of protein. According to one 2009 review updated in 2018, people with chronic kidney disease who reduce protein consumption have less likelihood of progressing to end stage kidney disease. Moreover, people with this disease while using a low protein diet (0.6 g/kg/d - 0.8 g/kg/d) may develop metabolic compensations that preserve kidney function, although in some people, malnutrition may occur.

Phenylketonuria

Individuals with phenylketonuria (PKU) must keep their intake of phenylalanine - an essential amino acid - extremely low to prevent a mental disability and other metabolic complications. Phenylalanine is a component of the artificial sweetener aspartame, so people with PKU need to avoid low calorie beverages and foods with this ingredient.

Maple syrup urine disease

Maple syrup urine disease is associated with genetic anomalies in the metabolism of branched-chain amino acids (BCAAs). They have high blood levels of BCAAs and must severely restrict their intake of BCAAs in order to prevent mental retardation and death. The amino acids in question are leucine, isoleucine and valine. The condition gets its name from the distinctive sweet odor of affected infants' urine. Children of Amish, Mennonite, and Ashkenazi Jewish descent have a high prevalence of this disease compared to other populations.

Excess consumption

The U.S. and Canadian Dietary Reference Intake review for protein concluded that there was not sufficient evidence to establish a Tolerable upper intake level, i.e., an upper limit for how much protein can be safely consumed.

When amino acids are in excess of needs, the liver takes up the amino acids and deaminates them, a process converting the nitrogen from the amino acids into ammonia, further processed in the liver into urea via the urea cycle. Excretion of urea occurs via the kidneys. Other parts of the amino acid molecules can be converted into glucose and used for fuel. When food protein intake is periodically high or low, the body tries to keep protein levels at an equilibrium by using the "labile protein reserve" to compensate for daily variations in protein intake. However, unlike body fat as a reserve for future caloric needs, there is no protein storage for future needs.

Excessive protein intake may increase calcium excretion in urine, occurring to compensate for the pH imbalance from oxidation of sulfur amino acids. This may lead to a higher risk of kidney stone formation from calcium in the renal circulatory system. One meta-analysis reported no adverse effects of higher protein intakes on bone density. Another meta-analysis reported a small decrease in systolic and diastolic blood pressure with diets higher in protein, with no differences between animal and plant protein.

High protein diets have been shown to lead to an additional 1.21 kg of weight loss over a period of 3 months versus a baseline protein diet in a meta-analysis. Benefits of decreased body mass index as well as HDL cholesterol were more strongly observed in studies with only a slight increase in protein intake rather where high protein intake was classified as 45% of total energy intake. Detrimental effects to cardiovascular activity were not observed in short-term diets of 6 months or less. There is little consensus on the potentially detrimental effects to healthy individuals of a long-term high protein diet, leading to caution advisories about using high protein intake as a form of weight loss.

The 2015–2020 Dietary Guidelines for Americans (DGA) recommends that men and teenage boys increase their consumption of fruits, vegetables and other under-consumed foods, and that a means of accomplishing this would be to reduce overall intake of protein foods. The 2015 - 2020 DGA report does not set a recommended limit for the intake of red and processed meat. While the report acknowledges research showing that lower intake of red and processed meat is correlated with reduced risk of cardiovascular diseases in adults, it also notes the value of nutrients provided from these meats. The recommendation is not to limit intake of meats or protein, but rather to monitor and keep within daily limits the sodium (< 2300 mg), saturated fats (less than 10% of total calories per day), and added sugars (less than 10% of total calories per day) that may be increased as a result of consumption of certain meats and proteins. While the 2015 DGA report does advise for a reduced level of consumption of red and processed meats, the 2015-2020 DGA key recommendations recommend that a variety of protein foods be consumed, including both vegetarian and non-vegetarian sources of protein.

Protein deficiency

A child in Nigeria during the Biafra War suffering from kwashiorkor – one of the three protein energy malnutrition ailments afflicting over 10 million children in developing countries.
 

Protein deficiency and malnutrition (PEM) can lead to variety of ailments including Intellectual disability and kwashiorkor. Symptoms of kwashiorkor include apathy, diarrhea, inactivity, failure to grow, flaky skin, fatty liver, and edema of the belly and legs. This edema is explained by the action of lipoxygenase on arachidonic acid to form leukotrienes and the normal functioning of proteins in fluid balance and lipoprotein transport.

PEM is fairly common worldwide in both children and adults and accounts for 6 million deaths annually. In the industrialized world, PEM is predominantly seen in hospitals, is associated with disease, or is often found in the elderly.

Education

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Education Education is the transmissio...