A Medley of Potpourri

Monday, February 24, 2025

Polymerase chain reaction

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Polymerase_chain_reaction

The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA (or a part of it) sufficiently to enable detailed study. PCR was invented in 1983 by American biochemist Kary Mullis at Cetus Corporation. Mullis and biochemist Michael Smith, who had developed other essential ways of manipulating DNA, were jointly awarded the Nobel Prize in Chemistry in 1993.

PCR is fundamental to many of the procedures used in genetic testing and research, including analysis of ancient samples of DNA and identification of infectious agents. Using PCR, copies of very small amounts of DNA sequences are exponentially amplified in a series of cycles of temperature changes. PCR is now a common and often indispensable technique used in medical laboratory research for a broad variety of applications including biomedical research and forensic science.

The majority of PCR methods rely on thermal cycling. Thermal cycling exposes reagents to repeated cycles of heating and cooling to permit different temperature-dependent reactions—specifically, DNA melting and enzyme-driven DNA replication. PCR employs two main reagents—primers (which are short single strand DNA fragments known as oligonucleotides that are a complementary sequence to the target DNA region) and a thermostable DNA polymerase. In the first step of PCR, the two strands of the DNA double helix are physically separated at a high temperature in a process called nucleic acid denaturation. In the second step, the temperature is lowered and the primers bind to the complementary sequences of DNA. The two DNA strands then become templates for DNA polymerase to enzymatically assemble a new DNA strand from free nucleotides, the building blocks of DNA. As PCR progresses, the DNA generated is itself used as a template for replication, setting in motion a chain reaction in which the original DNA template is exponentially amplified.

Almost all PCR applications employ a heat-stable DNA polymerase, such as Taq polymerase, an enzyme originally isolated from the thermophilic bacterium Thermus aquaticus. If the polymerase used was heat-susceptible, it would denature under the high temperatures of the denaturation step. Before the use of Taq polymerase, DNA polymerase had to be manually added every cycle, which was a tedious and costly process.

Applications of the technique include DNA cloning for sequencing, gene cloning and manipulation, gene mutagenesis; construction of DNA-based phylogenies, or functional analysis of genes; diagnosis and monitoring of genetic disorders; amplification of ancient DNA; analysis of genetic fingerprints for DNA profiling (for example, in forensic science and parentage testing); and detection of pathogens in nucleic acid tests for the diagnosis of infectious diseases.

Principles

PCR amplifies a specific region of a DNA strand (the DNA target). Most PCR methods amplify DNA fragments of between 0.1 and 10 kilo base pairs (kbp) in length, although some techniques allow for amplification of fragments up to 40 kbp. The amount of amplified product is determined by the available substrates in the reaction, which becomes limiting as the reaction progresses.

A basic PCR set-up requires several components and reagents, including:

a DNA template that contains the DNA target region to amplify
a DNA polymerase; an enzyme that polymerizes new DNA strands; heat-resistant Taq polymerase is especially common, as it is more likely to remain intact during the high-temperature DNA denaturation process
two DNA primers that are complementary to the 3' (three prime) ends of each of the sense and anti-sense strands of the DNA target (DNA polymerase can only bind to and elongate from a double-stranded region of DNA; without primers, there is no double-stranded initiation site at which the polymerase can bind); specific primers that are complementary to the DNA target region are selected beforehand, and are often custom-made in a laboratory or purchased from commercial biochemical suppliers
deoxynucleoside triphosphates, or dNTPs (sometimes called "deoxynucleotide triphosphates"; nucleotides containing triphosphate groups), the building blocks from which the DNA polymerase synthesizes a new DNA strand
a buffer solution providing a suitable chemical environment for optimum activity and stability of the DNA polymerase
bivalent cations, typically magnesium (Mg) or manganese (Mn) ions; Mg²⁺ is the most common, but Mn²⁺ can be used for PCR-mediated DNA mutagenesis, as a higher Mn²⁺ concentration increases the error rate during DNA synthesis; and monovalent cations, typically potassium (K) ions.

The reaction is commonly carried out in a volume of 10–200 μL in small reaction tubes (0.2–0.5 mL volumes) in a thermal cycler. The thermal cycler heats and cools the reaction tubes to achieve the temperatures required at each step of the reaction (see below). Many modern thermal cyclers make use of a Peltier device, which permits both heating and cooling of the block holding the PCR tubes simply by reversing the device's electric current. Thin-walled reaction tubes permit favorable thermal conductivity to allow for rapid thermal equilibrium. Most thermal cyclers have heated lids to prevent condensation at the top of the reaction tube. Older thermal cyclers lacking a heated lid require a layer of oil on top of the reaction mixture or a ball of wax inside the tube.

Procedure

Typically, PCR consists of a series of 20–40 repeated temperature changes, called thermal cycles, with each cycle commonly consisting of two or three discrete temperature steps (see figure below). The cycling is often preceded by a single temperature step at a very high temperature (>90 °C (194 °F)), and followed by one hold at the end for final product extension or brief storage. The temperatures used and the length of time they are applied in each cycle depend on a variety of parameters, including the enzyme used for DNA synthesis, the concentration of bivalent ions and dNTPs in the reaction, and the melting temperature (T_m) of the primers. The individual steps common to most PCR methods are as follows:

Initialization: This step is only required for DNA polymerases that require heat activation by hot-start PCR. It consists of heating the reaction chamber to a temperature of 94–96 °C (201–205 °F), or 98 °C (208 °F) if extremely thermostable polymerases are used, which is then held for 1–10 minutes.
Denaturation: This step is the first regular cycling event and consists of heating the reaction chamber to 94–98 °C (201–208 °F) for 20–30 seconds. This causes DNA melting, or denaturation, of the double-stranded DNA template by breaking the hydrogen bonds between complementary bases, yielding two single-stranded DNA molecules.
Annealing: In the next step, the reaction temperature is lowered to 50–65 °C (122–149 °F) for 20–40 seconds, allowing annealing of the primers to each of the single-stranded DNA templates. Two different primers are typically included in the reaction mixture: one for each of the two single-stranded complements containing the target region. The primers are single-stranded sequences themselves, but are much shorter than the length of the target region, complementing only very short sequences at the 3' end of each strand.

It is critical to determine a proper temperature for the annealing step because efficiency and specificity are strongly affected by the annealing temperature. This temperature must be low enough to allow for hybridization of the primer to the strand, but high enough for the hybridization to be specific, i.e., the primer should bind only to a perfectly complementary part of the strand, and nowhere else. If the temperature is too low, the primer may bind imperfectly. If it is too high, the primer may not bind at all. A typical annealing temperature is about 3–5 °C below the T_m of the primers used. Stable hydrogen bonds between complementary bases are formed only when the primer sequence very closely matches the template sequence. During this step, the polymerase binds to the primer-template hybrid and begins DNA formation.

Extension/elongation: The temperature at this step depends on the DNA polymerase used; the optimum activity temperature for the thermostable DNA polymerase of Taq polymerase is approximately 75–80 °C (167–176 °F), though a temperature of 72 °C (162 °F) is commonly used with this enzyme. In this step, the DNA polymerase synthesizes a new DNA strand complementary to the DNA template strand by adding free dNTPs from the reaction mixture that is complementary to the template in the 5'-to-3' direction, condensing the 5'-phosphate group of the dNTPs with the 3'-hydroxy group at the end of the nascent (elongating) DNA strand. The precise time required for elongation depends both on the DNA polymerase used and on the length of the DNA target region to amplify. As a rule of thumb, at their optimal temperature, most DNA polymerases polymerize a thousand bases per minute. Under optimal conditions (i.e., if there are no limitations due to limiting substrates or reagents), at each extension/elongation step, the number of DNA target sequences is doubled. With each successive cycle, the original template strands plus all newly generated strands become template strands for the next round of elongation, leading to exponential (geometric) amplification of the specific DNA target region.

The processes of denaturation, annealing and elongation constitute a single cycle. Multiple cycles are required to amplify the DNA target to millions of copies. The formula used to calculate the number of DNA copies formed after a given number of cycles is 2ⁿ, where n is the number of cycles. Thus, a reaction set for 30 cycles results in 2³⁰, or 1,073,741,824 copies of the original double-stranded DNA target region.

Final elongation: This single step is optional, but is performed at a temperature of 70–74 °C (158–165 °F) (the temperature range required for optimal activity of most polymerases used in PCR) for 5–15 minutes after the last PCR cycle to ensure that any remaining single-stranded DNA is fully elongated.
Final hold: The final step cools the reaction chamber to 4–15 °C (39–59 °F) for an indefinite time, and may be employed for short-term storage of the PCR products.

To check whether the PCR successfully generated the anticipated DNA target region (also sometimes referred to as the amplimer or amplicon), agarose gel electrophoresis may be employed for size separation of the PCR products. The size of the PCR products is determined by comparison with a DNA ladder, a molecular weight marker which contains DNA fragments of known sizes, which runs on the gel alongside the PCR products.

Stages

As with other chemical reactions, the reaction rate and efficiency of PCR are affected by limiting factors. Thus, the entire PCR process can further be divided into three stages based on reaction progress:

Exponential amplification: At every cycle, the amount of product is doubled (assuming 100% reaction efficiency). After 30 cycles, a single copy of DNA can be increased up to 1,000,000,000 (one billion) copies. In a sense, then, the replication of a discrete strand of DNA is being manipulated in a tube under controlled conditions.^[16] The reaction is very sensitive: only minute quantities of DNA must be present.
Leveling off stage: The reaction slows as the DNA polymerase loses activity and as consumption of reagents, such as dNTPs and primers, causes them to become more limited.
Plateau: No more product accumulates due to exhaustion of reagents and enzyme.

Optimization

In practice, PCR can fail for various reasons, such as sensitivity or contamination.Contamination with extraneous DNA can lead to spurious products and is addressed with lab protocols and procedures that separate pre-PCR mixtures from potential DNA contaminants. For instance, if DNA from a crime scene is analyzed, a single DNA molecule from lab personnel could be amplified and misguide the investigation. Hence the PCR-setup areas is separated from the analysis or purification of other PCR products, disposable plasticware used, and the work surface between reaction setups needs to be thoroughly cleaned.

Specificity can be adjusted by experimental conditions so that no spurious products are generated. Primer-design techniques are important in improving PCR product yield and in avoiding the formation of unspecific products. The usage of alternate buffer components or polymerase enzymes can help with amplification of long or otherwise problematic regions of DNA. For instance, Q5 polymerase is said to be ≈280 times less error-prone than Taq polymerase.Both the running parameters (e.g. temperature and duration of cycles), or the addition of reagents, such as formamide, may increase the specificity and yield of PCR. Computer simulations of theoretical PCR results (Electronic PCR) may be performed to assist in primer design.

Applications

Selective DNA isolation

PCR allows isolation of DNA fragments from genomic DNA by selective amplification of a specific region of DNA. This use of PCR augments many ways, such as generating hybridization probes for Southern or northern hybridization and DNA cloning, which require larger amounts of DNA, representing a specific DNA region. PCR supplies these techniques with high amounts of pure DNA, enabling analysis of DNA samples even from very small amounts of starting material.

Other applications of PCR include DNA sequencing to determine unknown PCR-amplified sequences in which one of the amplification primers may be used in Sanger sequencing, isolation of a DNA sequence to expedite recombinant DNA technologies involving the insertion of a DNA sequence into a plasmid, phage, or cosmid (depending on size) or the genetic material of another organism. Bacterial colonies (such as E. coli) can be rapidly screened by PCR for correct DNA vector constructs. PCR may also be used for genetic fingerprinting; a forensic technique used to identify a person or organism by comparing experimental DNAs through different PCR-based methods.

Some PCR fingerprint methods have high discriminative power and can be used to identify genetic relationships between individuals, such as parent-child or between siblings, and are used in paternity testing (Fig. 4). This technique may also be used to determine evolutionary relationships among organisms when certain molecular clocks are used (i.e. the 16S rRNA and recA genes of microorganisms).

Amplification and quantification of DNA

Because PCR amplifies the regions of DNA that it targets, PCR can be used to analyze extremely small amounts of sample. This is often critical for forensic analysis, when only a trace amount of DNA is available as evidence. PCR may also be used in the analysis of ancient DNA that is tens of thousands of years old. These PCR-based techniques have been successfully used on animals, such as a forty-thousand-year-old mammoth, and also on human DNA, in applications ranging from the analysis of Egyptian mummies to the identification of a Russian tsar and the body of English king Richard III.

Quantitative PCR or Real Time PCR (qPCR, not to be confused with RT-PCR) methods allow the estimation of the amount of a given sequence present in a sample—a technique often applied to quantitatively determine levels of gene expression. Quantitative PCR is an established tool for DNA quantification that measures the accumulation of DNA product after each round of PCR amplification.

qPCR allows the quantification and detection of a specific DNA sequence in real time since it measures concentration while the synthesis process is taking place. There are two methods for simultaneous detection and quantification. The first method consists of using fluorescent dyes that are retained nonspecifically in between the double strands. The second method involves probes that code for specific sequences and are fluorescently labeled. Detection of DNA using these methods can only be seen after the hybridization of probes with its complementary DNA (cDNA) takes place. An interesting technique combination is real-time PCR and reverse transcription. This sophisticated technique, called RT-qPCR, allows for the quantification of a small quantity of RNA. Through this combined technique, mRNA is converted to cDNA, which is further quantified using qPCR. This technique lowers the possibility of error at the end point of PCR, increasing chances for detection of genes associated with genetic diseases such as cancer. Laboratories use RT-qPCR for the purpose of sensitively measuring gene regulation. The mathematical foundations for the reliable quantification of the PCR and RT-qPCR facilitate the implementation of accurate fitting procedures of experimental data in research, medical, diagnostic and infectious disease applications.

Medical and diagnostic applications

Prospective parents can be tested for being genetic carriers, or their children might be tested for actually being affected by a disease. DNA samples for prenatal testing can be obtained by amniocentesis, chorionic villus sampling, or even by the analysis of rare fetal cells circulating in the mother's bloodstream. PCR analysis is also essential to preimplantation genetic diagnosis, where individual cells of a developing embryo are tested for mutations.

PCR can also be used as part of a sensitive test for tissue typing, vital to organ transplantation. As of 2008, there is even a proposal to replace the traditional antibody-based tests for blood type with PCR-based tests.
Many forms of cancer involve alterations to oncogenes. By using PCR-based tests to study these mutations, therapy regimens can sometimes be individually customized to a patient. PCR permits early diagnosis of malignant diseases such as leukemia and lymphomas, which is currently the highest-developed in cancer research and is already being used routinely. PCR assays can be performed directly on genomic DNA samples to detect translocation-specific malignant cells at a sensitivity that is at least 10,000 fold higher than that of other methods. PCR is very useful in the medical field since it allows for the isolation and amplification of tumor suppressors. Quantitative PCR for example, can be used to quantify and analyze single cells, as well as recognize DNA, mRNA and protein confirmations and combinations.

Infectious disease applications

PCR allows for rapid and highly specific diagnosis of infectious diseases, including those caused by bacteria or viruses.^[36] PCR also permits identification of non-cultivatable or slow-growing microorganisms such as mycobacteria, anaerobic bacteria, or viruses from tissue culture assays and animal models. The basis for PCR diagnostic applications in microbiology is the detection of infectious agents and the discrimination of non-pathogenic from pathogenic strains by virtue of specific genes.

Characterization and detection of infectious disease organisms have been revolutionized by PCR in the following ways:

The human immunodeficiency virus (or HIV), is a difficult target to find and eradicate. The earliest tests for infection relied on the presence of antibodies to the virus circulating in the bloodstream. However, antibodies don't appear until many weeks after infection, maternal antibodies mask the infection of a newborn, and therapeutic agents to fight the infection don't affect the antibodies. PCR tests have been developed that can detect as little as one viral genome among the DNA of over 50,000 host cells. Infections can be detected earlier, donated blood can be screened directly for the virus, newborns can be immediately tested for infection, and the effects of antiviral treatments can be quantified.
Some disease organisms, such as that for tuberculosis, are difficult to sample from patients and slow to be grown in the laboratory. PCR-based tests have allowed detection of small numbers of disease organisms (both live or dead), in convenient samples. Detailed genetic analysis can also be used to detect antibiotic resistance, allowing immediate and effective therapy. The effects of therapy can also be immediately evaluated.
The spread of a disease organism through populations of domestic or wild animals can be monitored by PCR testing. In many cases, the appearance of new virulent sub-types can be detected and monitored. The sub-types of an organism that were responsible for earlier epidemics can also be determined by PCR analysis.
Viral DNA can be detected by PCR. The primers used must be specific to the targeted sequences in the DNA of a virus, and PCR can be used for diagnostic analyses or DNA sequencing of the viral genome. The high sensitivity of PCR permits virus detection soon after infection and even before the onset of disease. Such early detection may give physicians a significant lead time in treatment. The amount of virus ("viral load") in a patient can also be quantified by PCR-based DNA quantitation techniques (see below). A variant of PCR (RT-PCR) is used for detecting viral RNA rather than DNA: in this test the enzyme reverse transcriptase is used to generate a DNA sequence which matches the viral RNA; this DNA is then amplified as per the usual PCR method. RT-PCR is widely used to detect the SARS-CoV-2 viral genome.
Diseases such as pertussis (or whooping cough) are caused by the bacteria Bordetella pertussis. This bacteria is marked by a serious acute respiratory infection that affects various animals and humans and has led to the deaths of many young children. The pertussis toxin is a protein exotoxin that binds to cell receptors by two dimers and reacts with different cell types such as T lymphocytes which play a role in cell immunity. PCR is an important testing tool that can detect sequences within the gene for the pertussis toxin. Because PCR has a high sensitivity for the toxin and a rapid turnaround time, it is very efficient for diagnosing pertussis when compared to culture.

Forensic applications

The development of PCR-based genetic (or DNA) fingerprinting protocols has seen widespread application in forensics:

DNA samples are often taken at crime scenes and analyzed by PCR.
In its most discriminating form, genetic fingerprinting can uniquely discriminate any one person from the entire population of the world. Minute samples of DNA can be isolated from a crime scene, and compared to that from suspects, or from a DNA database of earlier evidence or convicts. Simpler versions of these tests are often used to rapidly rule out suspects during a criminal investigation. Evidence from decades-old crimes can be tested, confirming or exonerating the people originally convicted.
Forensic DNA typing has been an effective way of identifying or exonerating criminal suspects due to analysis of evidence discovered at a crime scene. The human genome has many repetitive regions that can be found within gene sequences or in non-coding regions of the genome. Specifically, up to 40% of human DNA is repetitive. There are two distinct categories for these repetitive, non-coding regions in the genome. The first category is called variable number tandem repeats (VNTR), which are 10–100 base pairs long and the second category is called short tandem repeats (STR) and these consist of repeated 2–10 base pair sections. PCR is used to amplify several well-known VNTRs and STRs using primers that flank each of the repetitive regions. The sizes of the fragments obtained from any individual for each of the STRs will indicate which alleles are present. By analyzing several STRs for an individual, a set of alleles for each person will be found that statistically is likely to be unique. Researchers have identified the complete sequence of the human genome. This sequence can be easily accessed through the NCBI website and is used in many real-life applications. For example, the FBI has compiled a set of DNA marker sites used for identification, and these are called the Combined DNA Index System (CODIS) DNA database. Using this database enables statistical analysis to be used to determine the probability that a DNA sample will match. PCR is a very powerful and significant analytical tool to use for forensic DNA typing because researchers only need a very small amount of the target DNA to be used for analysis. For example, a single human hair with attached hair follicle has enough DNA to conduct the analysis. Similarly, a few sperm, skin samples from under the fingernails, or a small amount of blood can provide enough DNA for conclusive analysis.
Less discriminating forms of DNA fingerprinting can help in DNA paternity testing, where an individual is matched with their close relatives. DNA from unidentified human remains can be tested, and compared with that from possible parents, siblings, or children. Similar testing can be used to confirm the biological parents of an adopted (or kidnapped) child. The actual biological father of a newborn can also be confirmed (or ruled out).
The PCR AMGX/AMGY design has been shown to not only facilitate in amplifying DNA sequences from a very minuscule amount of genome. However it can also be used for real-time sex determination from forensic bone samples. This provides a powerful and effective way to determine gender in forensic cases and ancient specimens.

Research applications

PCR has been applied to many areas of research in molecular genetics:

PCR allows rapid production of short pieces of DNA, even when not more than the sequence of the two primers is known. This ability of PCR augments many methods, such as generating hybridization probes for Southern or northern blot hybridization. PCR supplies these techniques with large amounts of pure DNA, sometimes as a single strand, enabling analysis even from very small amounts of starting material.
The task of DNA sequencing can also be assisted by PCR. Known segments of DNA can easily be produced from a patient with a genetic disease mutation. Modifications to the amplification technique can extract segments from a completely unknown genome, or can generate just a single strand of an area of interest.
PCR has numerous applications to the more traditional process of DNA cloning. It can extract segments for insertion into a vector from a larger genome, which may be only available in small quantities. Using a single set of 'vector primers', it can also analyze or extract fragments that have already been inserted into vectors. Some alterations to the PCR protocol can generate mutations (general or site-directed) of an inserted fragment.
Sequence-tagged sites is a process where PCR is used as an indicator that a particular segment of a genome is present in a particular clone. The Human Genome Project found this application vital to mapping the cosmid clones they were sequencing, and to coordinating the results from different laboratories.
An application of PCR is the phylogenic analysis of DNA from ancient sources, such as that found in the recovered bones of Neanderthals, from frozen tissues of mammoths, or from the brain of Egyptian mummies. In some cases the highly degraded DNA from these sources might be reassembled during the early stages of amplification.
A common application of PCR is the study of patterns of gene expression. Tissues (or even individual cells) can be analyzed at different stages to see which genes have become active, or which have been switched off. This application can also use quantitative PCR to quantitate the actual levels of expression
The ability of PCR to simultaneously amplify several loci from individual sperm has greatly enhanced the more traditional task of genetic mapping by studying chromosomal crossovers after meiosis. Rare crossover events between very close loci have been directly observed by analyzing thousands of individual sperms. Similarly, unusual deletions, insertions, translocations, or inversions can be analyzed, all without having to wait (or pay) for the long and laborious processes of fertilization, embryogenesis, etc.
Site-directed mutagenesis: PCR can be used to create mutant genes with mutations chosen by scientists at will. These mutations can be chosen in order to understand how proteins accomplish their functions, and to change or improve protein function.

Advantages

PCR has a number of advantages. It is fairly simple to understand and to use, and produces results rapidly. The technique is highly sensitive with the potential to produce millions to billions of copies of a specific product for sequencing, cloning, and analysis. qRT-PCR shares the same advantages as the PCR, with an added advantage of quantification of the synthesized product. Therefore, it has its uses to analyze alterations of gene expression levels in tumors, microbes, or other disease states.

PCR is a very powerful and practical research tool. The sequencing of unknown etiologies of many diseases are being figured out by the PCR. The technique can help identify the sequence of previously unknown viruses related to those already known and thus give us a better understanding of the disease itself. If the procedure can be further simplified and sensitive non-radiometric detection systems can be developed, the PCR will assume a prominent place in the clinical laboratory for years to come.

Limitations

One major limitation of PCR is that prior information about the target sequence is necessary in order to generate the primers that will allow its selective amplification. This means that, typically, PCR users must know the precise sequence(s) upstream of the target region on each of the two single-stranded templates in order to ensure that the DNA polymerase properly binds to the primer-template hybrids and subsequently generates the entire target region during DNA synthesis.

Like all enzymes, DNA polymerases are also prone to error, which in turn causes mutations in the PCR fragments that are generated.

Another limitation of PCR is that even the smallest amount of contaminating DNA can be amplified, resulting in misleading or ambiguous results. To minimize the chance of contamination, investigators should reserve separate rooms for reagent preparation, the PCR, and analysis of product. Reagents should be dispensed into single-use aliquots. Pipettors with disposable plungers and extra-long pipette tips should be routinely used. It is moreover recommended to ensure that the lab set-up follows a unidirectional workflow. No materials or reagents used in the PCR and analysis rooms should ever be taken into the PCR preparation room without thorough decontamination.

Environmental samples that contain humic acids may inhibit PCR amplification and lead to inaccurate results.

Variations

Allele-specific PCR or The amplification refractory mutation system (ARMS): a diagnostic or cloning technique based on single-nucleotide variations (SNVs not to be confused with SNPs) (single-base differences in a patient). Any mutation involving single base change can be detected by this system. It requires prior knowledge of a DNA sequence, including differences between alleles, and uses primers whose 3' ends encompass the SNV (base pair buffer around SNV usually incorporated). PCR amplification under stringent conditions is much less efficient in the presence of a mismatch between template and primer, so successful amplification with an SNP-specific primer signals presence of the specific SNP or small deletions in a sequence. See SNP genotyping for more information.
Assembly PCR or Polymerase Cycling Assembly (PCA): artificial synthesis of long DNA sequences by performing PCR on a pool of long oligonucleotides with short overlapping segments. The oligonucleotides alternate between sense and antisense directions, and the overlapping segments determine the order of the PCR fragments, thereby selectively producing the final long DNA product.
Asymmetric PCR: preferentially amplifies one DNA strand in a double-stranded DNA template. It is used in sequencing and hybridization probing where amplification of only one of the two complementary strands is required. PCR is carried out as usual, but with a great excess of the primer for the strand targeted for amplification. Because of the slow (arithmetic) amplification later in the reaction after the limiting primer has been used up, extra cycles of PCR are required.^[49] A recent modification on this process, known as Linear-After-The-Exponential-PCR (LATE-PCR), uses a limiting primer with a higher melting temperature (T_m) than the excess primer to maintain reaction efficiency as the limiting primer concentration decreases mid-reaction.
Convective PCR: a pseudo-isothermal way of performing PCR. Instead of repeatedly heating and cooling the PCR mixture, the solution is subjected to a thermal gradient. The resulting thermal instability driven convective flow automatically shuffles the PCR reagents from the hot and cold regions repeatedly enabling PCR. Parameters such as thermal boundary conditions and geometry of the PCR enclosure can be optimized to yield robust and rapid PCR by harnessing the emergence of chaotic flow fields. Such convective flow PCR setup significantly reduces device power requirement and operation time.
Dial-out PCR: a highly parallel method for retrieving accurate DNA molecules for gene synthesis. A complex library of DNA molecules is modified with unique flanking tags before massively parallel sequencing. Tag-directed primers then enable the retrieval of molecules with desired sequences by PCR.
Digital PCR (dPCR): used to measure the quantity of a target DNA sequence in a DNA sample. The DNA sample is highly diluted so that after running many PCRs in parallel, some of them do not receive a single molecule of the target DNA. The target DNA concentration is calculated using the proportion of negative outcomes. Hence the name 'digital PCR'.
Helicase-dependent amplification: similar to traditional PCR, but uses a constant temperature rather than cycling through denaturation and annealing/extension cycles. DNA helicase, an enzyme that unwinds DNA, is used in place of thermal denaturation.
Hot start PCR: a technique that reduces non-specific amplification during the initial set up stages of the PCR. It may be performed manually by heating the reaction components to the denaturation temperature (e.g., 95 °C) before adding the polymerase. Specialized enzyme systems have been developed that inhibit the polymerase's activity at ambient temperature, either by the binding of an antibody or by the presence of covalently bound inhibitors that dissociate only after a high-temperature activation step. Hot-start/cold-finish PCR is achieved with new hybrid polymerases that are inactive at ambient temperature and are instantly activated at elongation temperature.
In silico PCR (digital PCR, virtual PCR, electronic PCR, e-PCR) refers to computational tools used to calculate theoretical polymerase chain reaction results using a given set of primers (probes) to amplify DNA sequences from a sequenced genome or transcriptome. In silico PCR was proposed as an educational tool for molecular biology.
Intersequence-specific PCR (ISSR): a PCR method for DNA fingerprinting that amplifies regions between simple sequence repeats to produce a unique fingerprint of amplified fragment lengths.
Inverse PCR: is commonly used to identify the flanking sequences around genomic inserts. It involves a series of DNA digestions and self ligation, resulting in known sequences at either end of the unknown sequence.
Ligation-mediated PCR: uses small DNA linkers ligated to the DNA of interest and multiple primers annealing to the DNA linkers; it has been used for DNA sequencing, genome walking, and DNA footprinting.
Methylation-specific PCR (MSP): developed by Stephen Baylin and James G. Herman at the Johns Hopkins School of Medicine, and is used to detect methylation of CpG islands in genomic DNA. DNA is first treated with sodium bisulfite, which converts unmethylated cytosine bases to uracil, which is recognized by PCR primers as thymine. Two PCRs are then carried out on the modified DNA, using primer sets identical except at any CpG islands within the primer sequences. At these points, one primer set recognizes DNA with cytosines to amplify methylated DNA, and one set recognizes DNA with uracil or thymine to amplify unmethylated DNA. MSP using qPCR can also be performed to obtain quantitative rather than qualitative information about methylation.
Miniprimer PCR: uses a thermostable polymerase (S-Tbr) that can extend from short primers ("smalligos") as short as 9 or 10 nucleotides. This method permits PCR targeting to smaller primer binding regions, and is used to amplify conserved DNA sequences, such as the 16S (or eukaryotic 18S) rRNA gene.
Multiplex ligation-dependent probe amplification (MLPA): permits amplifying multiple targets with a single primer pair, thus avoiding the resolution limitations of multiplex PCR (see below).
Multiplex-PCR: consists of multiple primer sets within a single PCR mixture to produce amplicons of varying sizes that are specific to different DNA sequences. By targeting multiple genes at once, additional information may be gained from a single test-run that otherwise would require several times the reagents and more time to perform. Annealing temperatures for each of the primer sets must be optimized to work correctly within a single reaction, and amplicon sizes. That is, their base pair length should be different enough to form distinct bands when visualized by gel electrophoresis.
Nanoparticle-assisted PCR (nanoPCR): some nanoparticles (NPs) can enhance the efficiency of PCR (thus being called nanoPCR), and some can even outperform the original PCR enhancers. It was reported that quantum dots (QDs) can improve PCR specificity and efficiency. Single-walled carbon nanotubes (SWCNTs) and multi-walled carbon nanotubes (MWCNTs) are efficient in enhancing the amplification of long PCR. Carbon nanopowder (CNP) can improve the efficiency of repeated PCR and long PCR, while zinc oxide, titanium dioxide and Ag NPs were found to increase the PCR yield. Previous data indicated that non-metallic NPs retained acceptable amplification fidelity. Given that many NPs are capable of enhancing PCR efficiency, it is clear that there is likely to be great potential for nanoPCR technology improvements and product development.
Nested PCR: increases the specificity of DNA amplification, by reducing background due to non-specific amplification of DNA. Two sets of primers are used in two successive PCRs. In the first reaction, one pair of primers is used to generate DNA products, which besides the intended target, may still consist of non-specifically amplified DNA fragments. The product(s) are then used in a second PCR with a set of primers whose binding sites are completely or partially different from and located 3' of each of the primers used in the first reaction. Nested PCR is often more successful in specifically amplifying long DNA fragments than conventional PCR, but it requires more detailed knowledge of the target sequences.
Overlap-extension PCR or Splicing by overlap extension (SOEing) : a genetic engineering technique that is used to splice together two or more DNA fragments that contain complementary sequences. It is used to join DNA pieces containing genes, regulatory sequences, or mutations; the technique enables creation of specific and long DNA constructs. It can also introduce deletions, insertions or point mutations into a DNA sequence.
PAN-AC: uses isothermal conditions for amplification, and may be used in living cells.
PAN-PCR: A computational method for designing bacterium typing assays based on whole genome sequence data.
Quantitative PCR (qPCR): used to measure the quantity of a target sequence (commonly in real-time). It quantitatively measures starting amounts of DNA, cDNA, or RNA. Quantitative PCR is commonly used to determine whether a DNA sequence is present in a sample and the number of its copies in the sample. Quantitative PCR has a very high degree of precision. Quantitative PCR methods use fluorescent dyes, such as Sybr Green, EvaGreen or fluorophore-containing DNA probes, such as TaqMan, to measure the amount of amplified product in real time. It is also sometimes abbreviated to RT-PCR (real-time PCR) but this abbreviation should be used only for reverse transcription PCR. qPCR is the appropriate contractions for quantitative PCR (real-time PCR).
Reverse Complement PCR (RC-PCR): Allows the addition of functional domains or sequences of choice to be appended independently to either end of the generated amplicon in a single closed tube reaction. This method generates target specific primers within the reaction by the interaction of universal primers (which contain the desired sequences or domains to be appended) and RC probes.
Reverse Transcription PCR (RT-PCR): for amplifying DNA from RNA. Reverse transcriptase reverse transcribes RNA into cDNA, which is then amplified by PCR. RT-PCR is widely used in expression profiling, to determine the expression of a gene or to identify the sequence of an RNA transcript, including transcription start and termination sites. If the genomic DNA sequence of a gene is known, RT-PCR can be used to map the location of exons and introns in the gene. The 5' end of a gene (corresponding to the transcription start site) is typically identified by RACE-PCR (Rapid Amplification of cDNA Ends).
RNase H-dependent PCR (rhPCR): a modification of PCR that utilizes primers with a 3' extension block that can be removed by a thermostable RNase HII enzyme. This system reduces primer-dimers and allows for multiplexed reactions to be performed with higher numbers of primers.
Single specific primer-PCR (SSP-PCR): allows the amplification of double-stranded DNA even when the sequence information is available at one end only. This method permits amplification of genes for which only a partial sequence information is available, and allows unidirectional genome walking from known into unknown regions of the chromosome.
Solid Phase PCR: encompasses multiple meanings, including Polony Amplification (where PCR colonies are derived in a gel matrix, for example), Bridge PCR (primers are covalently linked to a solid-support surface), conventional Solid Phase PCR (where Asymmetric PCR is applied in the presence of solid support bearing primer with sequence matching one of the aqueous primers) and Enhanced Solid Phase PCR (where conventional Solid Phase PCR can be improved by employing high Tm and nested solid support primer with optional application of a thermal 'step' to favour solid support priming).
Suicide PCR: typically used in paleogenetics or other studies where avoiding false positives and ensuring the specificity of the amplified fragment is the highest priority. It was originally described in a study to verify the presence of the microbe Yersinia pestis in dental samples obtained from 14th Century graves of people supposedly killed by the plague during the medieval Black Death epidemic. The method prescribes the use of any primer combination only once in a PCR (hence the term "suicide"), which should never have been used in any positive control PCR reaction, and the primers should always target a genomic region never amplified before in the lab using this or any other set of primers. This ensures that no contaminating DNA from previous PCR reactions is present in the lab, which could otherwise generate false positives.
Thermal asymmetric interlaced PCR (TAIL-PCR): for isolation of an unknown sequence flanking a known sequence. Within the known sequence, TAIL-PCR uses a nested pair of primers with differing annealing temperatures; a degenerate primer is used to amplify in the other direction from the unknown sequence.
Touchdown PCR (Step-down PCR): a variant of PCR that aims to reduce nonspecific background by gradually lowering the annealing temperature as PCR cycling progresses. The annealing temperature at the initial cycles is usually a few degrees (3–5 °C) above the T_m of the primers used, while at the later cycles, it is a few degrees (3–5 °C) below the primer T_m. The higher temperatures give greater specificity for primer binding, and the lower temperatures permit more efficient amplification from the specific products formed during the initial cycles.
Universal Fast Walking: for genome walking and genetic fingerprinting using a more specific 'two-sided' PCR than conventional 'one-sided' approaches (using only one gene-specific primer and one general primer—which can lead to artefactual 'noise') by virtue of a mechanism involving lariat structure formation. Streamlined derivatives of UFW are LaNe RAGE (lariat-dependent nested PCR for rapid amplification of genomic DNA ends), 5'RACE LaNe and 3'RACE LaNe.

History

The heat-resistant enzymes that are a key component in polymerase chain reaction were discovered in the 1960s as a product of a microbial life form that lived in the superheated waters of Yellowstone's Mushroom Spring.

A 1971 paper in the Journal of Molecular Biology by Kjell Kleppe and co-workers in the laboratory of H. Gobind Khorana first described a method of using an enzymatic assay to replicate a short DNA template with primers in vitro. However, this early manifestation of the basic PCR principle did not receive much attention at the time and the invention of the polymerase chain reaction in 1983 is generally credited to Kary Mullis.

When Mullis developed the PCR in 1983, he was working in Emeryville, California for Cetus Corporation, one of the first biotechnology companies, where he was responsible for synthesizing short chains of DNA. Mullis has written that he conceived the idea for PCR while cruising along the Pacific Coast Highway one night in his car. He was playing in his mind with a new way of analyzing changes (mutations) in DNA when he realized that he had instead invented a method of amplifying any DNA region through repeated cycles of duplication driven by DNA polymerase. In Scientific American, Mullis summarized the procedure: "Beginning with a single molecule of the genetic material DNA, the PCR can generate 100 billion similar molecules in an afternoon. The reaction is easy to execute. It requires no more than a test tube, a few simple reagents, and a source of heat." DNA fingerprinting was first used for paternity testing in 1988.

Mullis has credited his use of LSD as integral to his development of PCR: "Would I have invented PCR if I hadn't taken LSD? I seriously doubt it. I could sit on a DNA molecule and watch the polymers go by. I learnt that partly on psychedelic drugs."

Mullis and biochemist Michael Smith, who had developed other essential ways of manipulating DNA, were jointly awarded the Nobel Prize in Chemistry in 1993, seven years after Mullis and his colleagues at Cetus first put his proposal to practice. Mullis's 1985 paper with R. K. Saiki and H. A. Erlich, "Enzymatic Amplification of β-globin Genomic Sequences and Restriction Site Analysis for Diagnosis of Sickle Cell Anemia"—the polymerase chain reaction invention (PCR)—was honored by a Citation for Chemical Breakthrough Award from the Division of History of Chemistry of the American Chemical Society in 2017.

At the core of the PCR method is the use of a suitable DNA polymerase able to withstand the high temperatures of >90 °C (194 °F) required for separation of the two DNA strands in the DNA double helix after each replication cycle. The DNA polymerases initially employed for in vitro experiments presaging PCR were unable to withstand these high temperatures. So the early procedures for DNA replication were very inefficient and time-consuming, and required large amounts of DNA polymerase and continuous handling throughout the process.

The discovery in 1976 of Taq polymerase—a DNA polymerase purified from the thermophilic bacterium, Thermus aquaticus, which naturally lives in hot (50 to 80 °C (122 to 176 °F)) environments such as hot springs—paved the way for dramatic improvements of the PCR method. The DNA polymerase isolated from T. aquaticus is stable at high temperatures remaining active even after DNA denaturation, thus obviating the need to add new DNA polymerase after each cycle. This allowed an automated thermocycler-based process for DNA amplification.

Patent disputes

The PCR technique was patented by Kary Mullis and assigned to Cetus Corporation, where Mullis worked when he invented the technique in 1983. The Taq polymerase enzyme was also covered by patents. There have been several high-profile lawsuits related to the technique, including an unsuccessful lawsuit brought by DuPont. The Swiss pharmaceutical company Hoffmann-La Roche purchased the rights to the patents in 1992. The last of the commercial PCR patents expired in 2017.

A related patent battle over the Taq polymerase enzyme is still ongoing in several jurisdictions around the world between Roche and Promega. The legal arguments have extended beyond the lives of the original PCR and Taq polymerase patents, which expired on 28 March 2005.

Saturday, February 22, 2025

RAID

From Wikipedia, the free encyclopedia

RAID (/reɪd/; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical data storage components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives known as single large expensive disk (SLED)

Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance. The different schemes, or data distribution layouts, are named by the word "RAID" followed by a number, for example RAID 0 or RAID 1. Each scheme, or RAID level, provides a different balance among the key goals: reliability, availability, performance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable sector read errors, as well as against failures of whole physical drives.

History

The term "RAID" was invented by David Patterson, Garth Gibson, and Randy Katz at the University of California, Berkeley in 1987. In their June 1988 paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)", presented at the SIGMOD Conference, they argued that the top-performing mainframe disk drives of the time could be beaten on performance by an array of the inexpensive drives that had been developed for the growing personal computer market. Although failures would rise in proportion to the number of drives, by configuring for redundancy, the reliability of an array could far exceed that of any large single drive.

Although not yet using that terminology, the technologies of the five levels of RAID named in the June 1988 paper were used in various products prior to the paper's publication, including the following:

Mirroring (RAID 1) was well established in the 1970s including, for example, Tandem NonStop Systems.
In 1977, Norman Ken Ouchi at IBM filed a patent disclosing what was subsequently named RAID 4.
Around 1983, DEC began shipping subsystem mirrored RA8X disk drives (now known as RAID 1) as part of its HSC50 subsystem.
In 1986, Clark et al. at IBM filed a patent disclosing what was subsequently named RAID 5.
Around 1988, the Thinking Machines' DataVault used error correction codes (now known as RAID 2) in an array of disk drives. A similar approach was used in the early 1960s on the IBM 353.

Industry manufacturers later redefined the RAID acronym to stand for "redundant array of independent disks".

Overview

Many RAID levels employ an error protection scheme called "parity", a widely used method in information technology to provide fault tolerance in a given set of data. Most use simple XOR, but RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois field or Reed–Solomon error correction.

RAID can also provide data security with solid-state drives (SSDs) without the expense of an all-SSD system. For example, a fast SSD can be mirrored with a mechanical drive. For this configuration to provide a significant speed advantage, an appropriate controller is needed that uses the fast SSD for all read operations. Adaptec calls this "hybrid RAID".

Standard levels

Originally, there were five standard levels of RAID, but many variations have evolved, including several nested levels and many non-standard levels (mostly proprietary). RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard:

RAID 0 consists of block-level striping, but no mirroring or parity. Assuming n fully-used drives of equal capacity, the capacity of a RAID 0 volume matches that of a spanned volume: the total of the n drives' capacities. However, because striping distributes the contents of each file across all drives, the failure of any drive renders the entire RAID 0 volume inaccessible. Typically, all data is lost, and files cannot be recovered without a backup copy.

By contrast, a spanned volume, which stores files sequentially, loses data stored on the failed drive but preserves data stored on the remaining drives. However, recovering the files after drive failure can be challenging and often depends on the specifics of the filesystem. Regardless, files that span onto or off a failed drive will be permanently lost.

On the other hand, the benefit of RAID 0 is that the throughput of read and write operations to any file is multiplied by the number of drives because, unlike spanned volumes, reads and writes are performed concurrently. The cost is increased vulnerability to drive failures—since any drive in a RAID 0 setup failing causes the entire volume to be lost, the average failure rate of the volume rises with the number of attached drives. This makes RAID 0 a poor choice for scenarios requiring data reliability or fault tolerance.

RAID 1 consists of data mirroring, without parity or striping. Data is written identically to two or more drives, thereby producing a "mirrored set" of drives. Thus, any read request can be serviced by any drive in the set. If a request is broadcast to every drive in the set, it can be serviced by the drive that accesses the data first (depending on its seek time and rotational latency), improving performance. Sustained read throughput, if the controller or software is optimized for it, approaches the sum of throughputs of every drive in the set, just as for RAID 0. Actual read throughput of most RAID 1 implementations is slower than the fastest drive. Write throughput is always slower because every drive must be updated, and the slowest drive limits the write performance. The array continues to operate as long as at least one drive is functioning.
RAID 2 consists of bit-level striping with dedicated Hamming-code parity. All disk spindle rotation is synchronized and data is striped such that each sequential bit is on a different drive. Hamming-code parity is calculated across corresponding bits and stored on at least one parity drive. This level is of historical significance only; although it was used on some early machines (for example, the Thinking Machines CM-2), as of 2014 it is not used by any commercially available system.
RAID 3 consists of byte-level striping with dedicated parity. All disk spindle rotation is synchronized and data is striped such that each sequential byte is on a different drive. Parity is calculated across corresponding bytes and stored on a dedicated parity drive. Although implementations exist, RAID 3 is not commonly used in practice.
RAID 4 consists of block-level striping with dedicated parity. This level was previously used by NetApp, but has now been largely replaced by a proprietary implementation of RAID 4 with two parity disks, called RAID-DP. The main advantage of RAID 4 over RAID 2 and 3 is I/O parallelism: in RAID 2 and 3, a single read I/O operation requires reading the whole group of data drives, while in RAID 4 one I/O read operation does not have to spread across all data drives. As a result, more I/O operations can be executed in parallel, improving the performance of small transfers.
RAID 5 consists of block-level striping with distributed parity. Unlike RAID 4, parity information is distributed among the drives, requiring all drives but one to be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks. Like all single-parity concepts, large RAID 5 implementations are susceptible to system failures because of trends regarding array rebuild time and the chance of drive failure during rebuild (see "Increasing rebuild time and failure probability" section, below). Rebuilding an array requires reading all data from all disks, opening a chance for a second drive failure and the loss of the entire array.
RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems, as large-capacity drives take longer to restore. RAID 6 requires a minimum of four disks. As with RAID 5, a single drive failure results in reduced performance of the entire array until the failed drive has been replaced. With a RAID 6 array, using drives from multiple sources and manufacturers, it is possible to mitigate most of the problems associated with RAID 5. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 instead of RAID 5. RAID 10 also minimizes these problems.

Nested (hybrid) RAID

In what was originally termed hybrid RAID, many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or arrays themselves. Arrays are rarely nested more than one level deep.

The final array is known as the top array. When the top array is RAID 0 (such as in RAID 1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).

RAID 0+1: creates two stripes and mirrors them. If a single drive failure occurs then one of the mirrors has failed, at this point it is running effectively as RAID 0 with no redundancy. Significantly higher risk is introduced during a rebuild than RAID 1+0 as all the data from all the drives in the remaining stripe has to be read rather than just from one drive, increasing the chance of an unrecoverable read error (URE) and significantly extending the rebuild window.
RAID 1+0: (see: RAID 10) creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses so long as no mirror loses all its drives.
JBOD RAID N+N: With JBOD (just a bunch of disks), it is possible to concatenate disks, but also volumes such as RAID sets. With larger drive capacities, write delay and rebuilding time increase dramatically (especially, as described above, with RAID 5 and RAID 6). By splitting a larger RAID N set into smaller subsets and concatenating them with linear JBOD, write and rebuilding time will be reduced. If a hardware RAID controller is not capable of nesting linear JBOD with RAID N, then linear JBOD can be achieved with OS-level software RAID in combination with separate RAID N subset volumes created within one, or more, hardware RAID controller(s). Besides a drastic speed increase, this also provides a substantial advantage: the possibility to start a linear JBOD with a small set of disks and to be able to expand the total set with disks of different size, later on (in time, disks of bigger size become available on the market). There is another advantage in the form of disaster recovery (if a RAID N subset happens to fail, then the data on the other RAID N subsets is not lost, reducing restore time).

Non-standard levels

Many configurations other than the basic numbered RAID levels are possible, and many companies, organizations, and groups have created their own non-standard configurations, in many cases designed to meet the specialized needs of a small niche group. Such configurations include the following:

Linux MD RAID 10 provides a general RAID driver that in its "near" layout defaults to a standard RAID 1 with two drives, and a standard RAID 1+0 with four drives; however, it can include any number of drives, including odd numbers. With its "far" layout, MD RAID 10 can run both striped and mirrored, even with only two drives in f2 layout; this runs mirroring with striped reads, giving the read performance of RAID 0. Regular RAID 1, as provided by Linux software RAID, does not stripe reads, but can perform reads in parallel.
Hadoop has a RAID system that generates a parity file by xor-ing a stripe of blocks in a single HDFS file.
BeeGFS, the parallel file system, has internal striping (comparable to file-based RAID0) and replication (comparable to file-based RAID10) options to aggregate throughput and capacity of multiple servers and is typically based on top of an underlying RAID to make disk failures transparent.
Declustered RAID scatters dual (or more) copies of the data across all disks (possibly hundreds) in a storage subsystem, while holding back enough spare capacity to allow for a few disks to fail. The scattering is based on algorithms which give the appearance of arbitrariness. When one or more disks fail the missing copies are rebuilt into that spare capacity, again arbitrarily. Because the rebuild is done from and to all the remaining disks, it operates much faster than with traditional RAID, reducing the overall impact on clients of the storage system.

Implementations

The distribution of data across multiple drives can be managed either by dedicated computer hardware or by software. A software solution may be part of the operating system, part of the firmware and drivers supplied with a standard drive controller (so-called "hardware-assisted software RAID"), or it may reside entirely within the hardware RAID controller.

Hardware-based

Hardware RAID controllers can be configured through card BIOS or Option ROM before an operating system is booted, and after the operating system is booted, proprietary configuration utilities are available from the manufacturer of each controller. Unlike the network interface controllers for Ethernet, which can usually be configured and serviced entirely through the common operating system paradigms like ifconfig in Unix, without a need for any third-party tools, each manufacturer of each RAID controller usually provides their own proprietary software tooling for each operating system that they deem to support, ensuring a vendor lock-in, and contributing to reliability issues.

For example, in FreeBSD, in order to access the configuration of Adaptec RAID controllers, users are required to enable Linux compatibility layer, and use the Linux tooling from Adaptec, potentially compromising the stability, reliability and security of their setup, especially when taking the long-term view.

Some other operating systems have implemented their own generic frameworks for interfacing with any RAID controller, and provide tools for monitoring RAID volume status, as well as facilitation of drive identification through LED blinking, alarm management and hot spare disk designations from within the operating system without having to reboot into card BIOS. For example, this was the approach taken by OpenBSD in 2005 with its bio(4) pseudo-device and the bioctl utility, which provide volume status, and allow LED/alarm/hotspare control, as well as the sensors (including the drive sensor) for health monitoring; this approach has subsequently been adopted and extended by NetBSD in 2007 as well.

Software-based

Software RAID implementations are provided by many modern operating systems. Software RAID can be implemented as:

A layer that abstracts multiple devices, thereby providing a single virtual device (such as Linux kernel's md and OpenBSD's softraid)
A more generic logical volume manager (provided with most server-class operating systems such as Veritas or LVM)
A component of the file system (such as ZFS, Spectrum Scale or Btrfs)
A layer that sits above any file system and provides parity protection to user data (such as RAID-F)

Some advanced file systems are designed to organize data across multiple storage devices directly, without needing the help of a third-party logical volume manager:

ZFS supports the equivalents of RAID 0, RAID 1, RAID 5 (RAID-Z1) single-parity, RAID 6 (RAID-Z2) double-parity, and a triple-parity version (RAID-Z3) also referred to as RAID 7. As it always stripes over top-level vdevs, it supports equivalents of the 1+0, 5+0, and 6+0 nested RAID levels (as well as striped triple-parity sets) but not other nested combinations. ZFS is the native file system on Solaris and illumos, and is also available on FreeBSD and Linux. Open-source ZFS implementations are actively developed under the OpenZFS umbrella project.
Spectrum Scale, initially developed by IBM for media streaming and scalable analytics, supports declustered RAID protection schemes up to n+3. A particularity is the dynamic rebuilding priority which runs with low impact in the background until a data chunk hits n+0 redundancy, in which case this chunk is quickly rebuilt to at least n+1. On top, Spectrum Scale supports metro-distance RAID 1.
Btrfs supports RAID 0, RAID 1 and RAID 10 (RAID 5 and 6 are under development).
XFS was originally designed to provide an integrated volume manager that supports concatenating, mirroring and striping of multiple physical storage devices. However, the implementation of XFS in Linux kernel lacks the integrated volume manager.

Many operating systems provide RAID implementations, including the following:

Hewlett-Packard's OpenVMS operating system supports RAID 1. The mirrored disks, called a "shadow set", can be in different locations to assist in disaster recovery.
Apple's macOS and macOS Server natively support RAID 0, RAID 1, and RAID 1+0,which can be created with Disk Utility or its command-line interface, while RAID 4 and RAID 5 can only be created using the third-party software SoftRAID by OWC, with the driver for SoftRAID access natively included since macOS 13.3.
FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOM modules and ccd.
Linux's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings. Certain reshaping/resizing/expanding operations are also supported.
Microsoft Windows supports RAID 0, RAID 1, and RAID 5 using various software implementations. Logical Disk Manager, introduced with Windows 2000, allows for the creation of RAID 0, RAID 1, and RAID 5 volumes by using dynamic disks, but this was limited only to professional and server editions of Windows until the release of Windows 8. Windows XP can be modified to unlock support for RAID 0, 1, and 5. Windows 8 and Windows Server 2012 introduced a RAID-like feature known as Storage Spaces, which also allows users to specify mirroring, parity, or no redundancy on a folder-by-folder basis. These options are similar to RAID 1 and RAID 5, but are implemented at a higher abstraction level.
NetBSD supports RAID 0, 1, 4, and 5 via its software implementation, named RAIDframe.
OpenBSD supports RAID 0, 1 and 5 via its software implementation, named softraid.

If a boot drive fails, the system has to be sophisticated enough to be able to boot from the remaining drive or drives. For instance, consider a computer whose disk is configured as RAID 1 (mirrored drives); if the first drive in the array fails, then a first-stage boot loader might not be sophisticated enough to attempt loading the second-stage boot loader from the second drive as a fallback. The second-stage boot loader for FreeBSD is capable of loading a kernel from such an array.

Firmware- and driver-based

Software-implemented RAID is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows. However, hardware RAID controllers are expensive and proprietary. To fill this gap, inexpensive "RAID controllers" were introduced that do not contain a dedicated RAID controller chip, but simply a standard drive controller chip, or the chipset built-in RAID function, with proprietary firmware and drivers. During early bootup, the RAID is implemented by the firmware and, once the operating system has been more completely loaded, the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system. An example is Intel Rapid Storage Technology, implemented on many consumer-level motherboards.

Because some minimal hardware support is involved, this implementation is also called "hardware-assisted software RAID", "hybrid model" RAID, or even "fake RAID". If RAID 5 is supported, the hardware may provide a hardware XOR accelerator. An advantage of this model over the pure software RAID is that—if using a redundancy mode—the boot drive is protected from failure (due to the firmware) during the boot process even before the operating system's drivers take over.

Integrity

Data scrubbing (referred to in some environments as patrol read) involves periodic reading and checking by the RAID controller of all the blocks in an array, including those not otherwise accessed. This detects bad blocks before use. Data scrubbing checks for bad blocks on each storage device in an array, but also uses the redundancy of the array to recover bad blocks on a single drive and to reassign the recovered data to spare blocks elsewhere on the drive.

Frequently, a RAID controller is configured to "drop" a component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for eight seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, using consumer-marketed drives with RAID can be risky, and so-called "enterprise class" drives limit this error recovery time to reduce risk. Western Digital's desktop drives used to have a specific fix. A utility called WDTLER.exe limited a drive's error recovery time. The utility enabled TLER (time limited error recovery), which limits the error recovery time to seven seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (such as the Caviar Black line), making such drives unsuitable for use in RAID configurations. However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. For non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive. In late 2010, the Smartmontools program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in RAID setups.

While RAID may protect against physical drive failure, the data is still exposed to operator, software, hardware, and virus destruction. Many studies cite operator fault as a common source of malfunction, such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process.

An array can be overwhelmed by catastrophic failure that exceeds its recovery capacity and the entire array is at risk of physical damage by fire, natural disaster, and human forces, however backups can be stored off site. An array is also vulnerable to controller failure because it is not always possible to migrate it to a new, different controller without data loss.

Weaknesses

Correlated failures

In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; failures are in fact statistically correlated. In practice, the chances for a second failure before the first has been recovered (causing data loss) are higher than the chances for random failures. In a study of about 100,000 drives, the probability of two drives in the same cluster failing within one hour was four times larger than predicted by the exponential statistical distribution—which characterizes processes in which events occur continuously and independently at a constant average rate. The probability of two failures in the same 10-hour period was twice as large as predicted by an exponential distribution.

Unrecoverable read errors during rebuild

Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). The associated media assessment measure, unrecoverable bit error (UBE) rate, is typically guaranteed to be less than one bit in 10¹⁵ for enterprise-class drives (SCSI, FC, SAS or SATA), and less than one bit in 10¹⁴ for desktop-class drives (IDE/ATA/PATA or SATA). Increasing drive capacities and large RAID 5 instances have led to the maximum error rates being insufficient to guarantee a successful recovery, due to the high likelihood of such an error occurring on one or more remaining drives during a RAID set rebuild. When rebuilding, parity-based schemes such as RAID 5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation.

Double-protection parity-based schemes, such as RAID 6, attempt to address this issue by providing redundancy that allows double-drive failures; as a downside, such schemes suffer from elevated write penalty—the number of times the storage medium must be accessed during a single write operation. Schemes that duplicate (mirror) data in a drive-to-drive manner, such as RAID 1 and RAID 10, have a lower risk from UREs than those using parity computation or mirroring between striped sets. Data scrubbing, as a background process, can be used to detect and recover from UREs, effectively reducing the risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector.

Increasing rebuild time and failure probability

Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger-capacity drives may take hours if not days to rebuild, during which time other drives may fail or yet undetected read errors may surface. The rebuild time is also limited if the entire array is still in operation at reduced capacity. Given an array with only one redundant drive (which applies to RAID levels 3, 4 and 5, and to "classic" two-drive RAID 1), a second drive failure would cause complete failure of the array. Even though individual drives' mean time between failure (MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.

Some commentators have declared that RAID 6 is only a "band aid" in this respect, because it only kicks the problem a little further down the road. However, according to the 2006 NetApp study of Berriman et al., the chance of failure decreases by a factor of about 3,800 (relative to RAID 5) for a proper implementation of RAID 6, even when using commodity drives.Nevertheless, if the currently observed technology trends remain unchanged, in 2019 a RAID 6 array will have the same chance of failure as its RAID 5 counterpart had in 2010.

Mirroring schemes such as RAID 10 have a bounded recovery time as they require the copy of a single failed drive, compared with parity schemes such as RAID 6, which require the copy of all blocks of the drives in an array set. Triple parity schemes, or triple mirroring, have been suggested as one approach to improve resilience to an additional drive failure during this large rebuild time.

Atomicity

A system crash or other interruption of a write operation can result in states where the parity is inconsistent with the data due to non-atomicity of the write process, such that the parity cannot be used for recovery in the case of a disk failure. This is commonly termed the write hole which is a known data corruption issue in older and low-end RAIDs, caused by interrupted destaging of writes to disk. The write hole can be addressed in a few ways:

Write-ahead logging.
- Hardware RAID systems use an onboard nonvolatile cache for this purpose.
- mdadm can use a dedicated journaling device (to avoid performance penalty, typically, SSDs and NVMs are preferred) for this purpose.
Write intent logging. mdadm uses a "write-intent-bitmap". If it finds any location marked as incompletely written at startup, it resyncs them. It closes the write hole but does not protect against loss of in-transit data, unlike a full WAL.
Partial parity. mdadm can save a "partial parity" that, when combined with modified chunks, recovers the original parity. This closes the write hole, but again does not protect against loss of in-transit data.
Dynamic stripe size. RAID-Z ensures that each block is its own stripe, so every block is complete. Copy-on-write (COW) transactional semantics guard metadata associated with stripes. The downside is IO fragmentation.
Avoiding overwriting used stripes. bcachefs, which uses a copying garbage collector, chooses this option. COW again protect references to striped data.

Write hole is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization.

Write-cache reliability

There are concerns about write-cache reliability, specifically regarding devices equipped with a write-back cache, which is a caching system that reports the data as written as soon as it is written to cache, as opposed to when it is written to the non-volatile medium. If the system experiences a power loss or other major failure, the data may be irrevocably lost from the cache before reaching the non-volatile storage. For this reason good write-back cache implementations include mechanisms, such as redundant battery power, to preserve cache contents across system failures (including power failures) and to flush the cache at system restart time.

Search This Blog

Monday, February 24, 2025

Polymerase chain reaction

Principles

Procedure

Stages

Optimization

Applications

Selective DNA isolation

Amplification and quantification of DNA

Medical and diagnostic applications

Infectious disease applications

Forensic applications

Research applications

Advantages

Limitations

Variations

History

Patent disputes

Saturday, February 22, 2025

RAID

History

Overview

Standard levels

Nested (hybrid) RAID

Non-standard levels

Implementations

Hardware-based

Software-based

Firmware- and driver-based

Integrity

Weaknesses

Correlated failures

Unrecoverable read errors during rebuild

Increasing rebuild time and failure probability

Atomicity

Write-cache reliability

Reality