Search This Blog

Thursday, December 20, 2018

Proteomics

From Wikipedia, the free encyclopedia

Robotic preparation of MALDI mass spectrometry samples on a sample carrier

Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions. The term proteomics was coined in 1997, in analogy to genomics, the study of the genome. The word proteome is a portmanteau of protein and genome, and was coined by Marc Wilkins in 1994 while he was a Ph.D. student at Macquarie University. Macquarie University also founded the first dedicated proteomics laboratory in 1995.

The proteome is the entire set of proteins that is produced or modified by an organism or system. Proteomics has enabled the identification of ever increasing numbers of protein. This varies with time and distinct requirements, or stresses, that a cell or organism undergoes. Proteomics is an interdisciplinary domain that has benefitted greatly from the genetic information of various genome projects, including the Human Genome Project. It covers the exploration of proteomes from the overall level of protein composition, structure, and activity. It is an important component of functional genomics

Proteomics generally refers to the large-scale experimental analysis of proteins and proteomes, but often is used specifically to refer to protein purification and mass spectrometry.

Complexity of the problem

After genomics and transcriptomics, proteomics is the next step in the study of biological systems. It is more complicated than genomics because an organism's genome is more or less constant, whereas proteomes differ from cell to cell and from time to time. Distinct genes are expressed in different cell types, which means that even the basic set of proteins that are produced in a cell needs to be identified. 

In the past this phenomenon was assessed by RNA analysis, but it was found to lack correlation with protein content. Now it is known that mRNA is not always translated into protein, and the amount of protein produced for a given amount of mRNA depends on the gene it is transcribed from and on the current physiological state of the cell. Proteomics confirms the presence of the protein and provides a direct measure of the quantity present.

Post-translational modifications

Not only does the translation from mRNA cause differences, but many proteins also are subjected to a wide variety of chemical modifications after translation. Many of these post-translational modifications are critical to the protein's function.

Phosphorylation

One such modification is phosphorylation, which happens to many enzymes and structural proteins in the process of cell signaling. The addition of a phosphate to particular amino acids—most commonly serine and threonine mediated by serine-threonine kinases, or more rarely tyrosine mediated by tyrosine kinases—causes a protein to become a target for binding or interacting with a distinct set of other proteins that recognize the phosphorylated domain. 

Because protein phosphorylation is one of the most-studied protein modifications, many "proteomic" efforts are geared to determining the set of phosphorylated proteins in a particular cell or tissue-type under particular circumstances. This alerts the scientist to the signaling pathways that may be active in that instance.

Ubiquitination

Ubiquitin is a small protein that may be affixed to certain protein substrates by enzymes called E3 ubiquitin ligases. Determining which proteins are poly-ubiquitinated helps understand how protein pathways are regulated. This is, therefore, an additional legitimate "proteomic" study. Similarly, once a researcher determines which substrates are ubiquitinated by each ligase, determining the set of ligases expressed in a particular cell type is helpful.

Additional modifications

In addition to phosphorylation and ubiquitination, proteins may be subjected to (among others) methylation, acetylation, glycosylation, oxidation, and nitrosylation. Some proteins undergo all these modifications, often in time-dependent combinations. This illustrates the potential complexity of studying protein structure and function.

Distinct proteins are made under distinct settings

A cell may make different sets of proteins at different times or under different conditions, for example during development, cellular differentiation, cell cycle, or carcinogenesis. Further increasing proteome complexity, as mentioned, most proteins are able to undergo a wide range of post-translational modifications. 

Therefore, a "proteomics" study may become complex very quickly, even if the topic of study is restricted. In more ambitious settings, such as when a biomarker for a specific cancer subtype is sought, the proteomics scientist might elect to study multiple blood serum samples from multiple cancer patients to minimise confounding factors and account for experimental noise. Thus, complicated experimental designs are sometimes necessary to account for the dynamic complexity of the proteome.

Limitations of genomics and proteomics studies

Proteomics gives a different level of understanding than genomics for many reasons:
  • the level of transcription of a gene gives only a rough estimate of its level of translation into a protein. An mRNA produced in abundance may be degraded rapidly or translated inefficiently, resulting in a small amount of protein.
  • as mentioned above many proteins experience post-translational modifications that profoundly affect their activities; for example, some proteins are not active until they become phosphorylated. Methods such as phosphoproteomics and glycoproteomics are used to study post-translational modifications.
  • many transcripts give rise to more than one protein, through alternative splicing or alternative post-translational modifications.
  • many proteins form complexes with other proteins or RNA molecules, and only function in the presence of these other molecules.
  • protein degradation rate plays an important role in protein content.
Reproducibility. One major factor affecting reproducibility in proteomics experiments is the simultaneous elution of many more peptides than mass spectrometers can measure. This causes stochastic differences between experiments due to data-dependent acquisition of tryptic peptides. Although early large-scale shotgun proteomics analyses showed considerable variability between laboratories, presumably due in part to technical and experimental differences between laboratories, reproducibility has been improved in more recent mass spectrometry analysis, particularly on the protein level and using Orbitrap mass spectrometers. Notably, targeted proteomics shows increased reproducibility and repeatability compared with shotgun methods, although at the expense of data density and effectiveness.

Methods of studying proteins

In proteomics, there are multiple methods to study proteins. Generally, proteins may be detected by using either antibodies (immunoassays) or mass spectrometry. If a complex biological sample is analyzed, either a very specific antibody needs to be used in quantitative dot blot analysis (qdb), or biochemical separation then needs to be used before the detection step, as there are too many analytes in the sample to perform accurate detection and quantification.

Protein detection with antibodies (immunoassays)

Antibodies to particular proteins, or to their modified forms, have been used in biochemistry and cell biology studies. These are among the most common tools used by molecular biologists today. There are several specific techniques and protocols that use antibodies for protein detection. The enzyme-linked immunosorbent assay (ELISA) has been used for decades to detect and quantitatively measure proteins in samples. The Western blot may be used for detection and quantification of individual proteins, where in an initial step, a complex protein mixture is separated using SDS-PAGE and then the protein of interest is identified using an antibody. 

Modified proteins may be studied by developing an antibody specific to that modification. For example, there are antibodies that only recognize certain proteins when they are tyrosine-phosphorylated, they are known as phospho-specific antibodies. Also, there are antibodies specific to other modifications. These may be used to determine the set of proteins that have undergone the modification of interest. 

Disease detection at the molecular level is driving the emerging revolution of early diagnosis and treatment. A challenge facing the field is that protein biomarkers for early diagnosis may be present in very low abundance. The lower limit of detection with conventional immunoassay technology is the upper femtomolar range (10(-13) M). Digital immunoassay technology has improved detection sensitivity three logs, to the attomolar range (10(-16) M). This capability has the potential to open new advances in diagnostics and therapeutics, but such technologies have been relegated to manual procedures that are not well suited for efficient routine use.

Antibody-free protein detection

While protein detection with antibodies is still very common in molecular biology, other methods have been developed also, that do not rely on an antibody. These methods offer various advantages, for instance they often are able to determine the sequence of a protein or peptide, they may have higher throughput than antibody-based, and they sometimes can identify and quantify proteins for which no antibody exists.

Detection methods

One of the earliest methods for protein analysis has been Edman degradation (introduced in 1967) where a single peptide is subjected to multiple steps of chemical degradation to resolve its sequence. These early methods have mostly been supplanted by technologies that offer higher throughput. 

More recently implemented methods use mass spectrometry-based techniques, a development that was made possible by the discovery of "soft ionization" methods developed in the 1980s, such as matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). These methods gave rise to the top-down and the bottom-up proteomics workflows where often additional separation is performed before analysis.

Separation methods

For the analysis of complex biological samples, a reduction of sample complexity is required. This may be performed off-line by one-dimensional or two-dimensional separation. More recently, on-line methods have been developed where individual peptides (in bottom-up proteomics approaches) are separated using Reversed-phase chromatography and then, directly ionized using ESI; the direct coupling of separation and analysis explains the term "on-line" analysis.

Hybrid technologies

There are several hybrid technologies that use antibody-based purification of individual analytes and then perform mass spectrometric analysis for identification and quantification. Examples of these methods are the MSIA (mass spectrometric immunoassay), developed by Randall Nelson in 1995, and the SISCAPA (Stable Isotope Standard Capture with Anti-Peptide Antibodies) method, introduced by Leigh Anderson in 2004.

Current research methodologies

Fluorescence two-dimensional differential gel electrophoresis (2-D DIGE) may be used to quantify variation in the 2-D DIGE process and establish statistically valid thresholds for assigning quantitative changes between samples.

Comparative proteomic analysis may reveal the role of proteins in complex biological systems, including reproduction. For example, treatment with the insecticide triazophos causes an increase in the content of brown planthopper (Nilaparvata lugens (Stål)) male accessory gland proteins (Acps) that may be transferred to females via mating, causing an increase in fecundity (i.e. birth rate) of females. To identify changes in the types of accessory gland proteins (Acps) and reproductive proteins that mated female planthoppers received from male planthoppers, researchers conducted a comparative proteomic analysis of mated N. lugens females. The results indicated that these proteins participate in the reproductive process of N. lugens adult females and males.

Proteome analysis of Arabidopsis peroxisomes has been established as the major unbiased approach for identifying new peroxisomal proteins on a large scale.

There are many approaches to characterizing the human proteome, which is estimated to contain between 20,000 and 25,000 non-redundant proteins. The number of unique protein species likely will increase by between 50,000 and 500,000 due to RNA splicing and proteolysis events, and when post-translational modification also are considered, the total number of unique human proteins is estimated to range in the low millions.

In addition, the first promising attempts to decipher the proteome of animal tumors have recently been reported. This method used as a functional method in Macrobrachium rosenbergii protein profiling.

High-throughput proteomic technologies

Proteomics has steadily gained momentum over the past decade with the evolution of several approaches. Few of these are new, and others build on traditional methods. Mass spectrometry-based methods and micro arrays are the most common technologies for large-scale study of proteins.

Mass spectrometry and protein profiling

There are two mass spectrometry-based methods currently used for protein profiling. The more established and widespread method uses high resolution, two-dimensional electrophoresis to separate proteins from different samples in parallel, followed by selection and staining of differentially expressed proteins to be identified by mass spectrometry. Despite the advances in 2-DE and its maturity, it has its limits as well. The central concern is the inability to resolve all the proteins within a sample, given their dramatic range in expression level and differing properties.

The second quantitative approach uses stable isotope tags to differentially label proteins from two different complex mixtures. Here, the proteins within a complex mixture are labeled isotopically first, and then digested to yield labeled peptides. The labeled mixtures are then combined, the peptides separated by multidimensional liquid chromatography and analyzed by tandem mass spectrometry. Isotope coded affinity tag (ICAT) reagents are the widely used isotope tags. In this method, the cysteine residues of proteins get covalently attached to the ICAT reagent, thereby reducing the complexity of the mixtures omitting the non-cysteine residues.

Quantitative proteomics using stable isotopic tagging is an increasingly useful tool in modern development. Firstly, chemical reactions have been used to introduce tags into specific sites or proteins for the purpose of probing specific protein functionalities. The isolation of phosphorylated peptides has been achieved using isotopic labeling and selective chemistries to capture the fraction of protein among the complex mixture. Secondly, the ICAT technology was used to differentiate between partially purified or purified macromolecular complexes such as large RNA polymerase II pre-initiation complex and the proteins complexed with yeast transcription factor. Thirdly, ICAT labeling was recently combined with chromatin isolation to identify and quantify chromatin-associated proteins. Finally ICAT reagents are useful for proteomic profiling of cellular organelles and specific cellular fractions.

Another quantitative approach is the Accurate Mass and Time (AMT) tag approach developed by Richard D. Smith and coworkers at Pacific Northwest National Laboratory. In this approach, increased throughput and sensitivity is achieved by avoiding the need for tandem mass spectrometry, and making use of precisely determined separation time information and highly accurate mass determinations for peptide and protein identifications.

Protein chips

Balancing the use of mass spectrometers in proteomics and in medicine is the use of protein micro arrays. The aim behind protein micro arrays is to print thousands of protein detecting features for the interrogation of biological samples. Antibody arrays are an example in which a host of different antibodies are arrayed to detect their respective antigens from a sample of human blood. Another approach is the arraying of multiple protein types for the study of properties like protein-DNA, protein-protein and protein-ligand interactions. Ideally, the functional proteomic arrays would contain the entire complement of the proteins of a given organism. The first version of such arrays consisted of 5000 purified proteins from yeast deposited onto glass microscopic slides. Despite the success of first chip, it was a greater challenge for protein arrays to be implemented. Proteins are inherently much more difficult to work with than DNA. They have a broad dynamic range, are less stable than DNA and their structure is difficult to preserve on glass slides, though they are essential for most assays. The global ICAT technology has striking advantages over protein chip technologies.

Reverse-phased protein microarrays

This is a promising and newer microarray application for the diagnosis, study and treatment of complex diseases such as cancer. The technology merges laser capture microdissection (LCM) with micro array technology, to produce reverse phase protein microarrays. In this type of microarrays, the whole collection of protein themselves are immobilized with the intent of capturing various stages of disease within an individual patient. When used with LCM, reverse phase arrays can monitor the fluctuating state of proteome among different cell population within a small area of human tissue. This is useful for profiling the status of cellular signaling molecules, among a cross section of tissue that includes both normal and cancerous cells. This approach is useful in monitoring the status of key factors in normal prostate epithelium and invasive prostate cancer tissues. LCM then dissects these tissue and protein lysates were arrayed onto nitrocellulose slides, which were probed with specific antibodies. This method can track all kinds of molecular events and can compare diseased and healthy tissues within the same patient enabling the development of treatment strategies and diagnosis. The ability to acquire proteomics snapshots of neighboring cell populations, using reverse phase microarrays in conjunction with LCM has a number of applications beyond the study of tumors. The approach can provide insights into normal physiology and pathology of all the tissues and is invaluable for characterizing developmental processes and anomalies.

Practical applications of proteomics

One major development to come from the study of human genes and proteins has been the identification of potential new drugs for the treatment of disease. This relies on genome and proteome information to identify proteins associated with a disease, which computer software can then use as targets for new drugs. For example, if a certain protein is implicated in a disease, its 3D structure provides the information to design drugs to interfere with the action of the protein. A molecule that fits the active site of an enzyme, but cannot be released by the enzyme, inactivates the enzyme. This is the basis of new drug-discovery tools, which aim to find new drugs to inactivate proteins involved in disease. As genetic differences among individuals are found, researchers expect to use these techniques to develop personalized drugs that are more effective for the individual.

Proteomics is also used to reveal complex plant-insect interactions that help identify candidate genes involved in the defensive response of plants to herbivory.

Interaction proteomics and protein networks

Interaction proteomics is the analysis of protein interactions from scales of binary interactions to proteome- or network-wide. Most proteins function via protein–protein interactions, and one goal of interaction proteomics is to identify binary protein interactions, protein complexes, and interactomes.
Several methods are available to probe protein–protein interactions. While the most traditional method is yeast two-hybrid analysis, a powerful emerging method is affinity purification followed by protein mass spectrometry using tagged protein baits. Other methods include surface plasmon resonance (SPR), protein microarrays, dual polarisation interferometry, microscale thermophoresis and experimental methods such as phage display and in silico computational methods. 

Knowledge of protein-protein interactions is especially useful in regard to biological networks and systems biology, for example in cell signaling cascades and gene regulatory networks (GRNs, where knowledge of protein-DNA interactions is also informative). Proteome-wide analysis of protein interactions, and integration of these interaction patterns into larger biological networks, is crucial towards understanding systems-level biology.

Expression proteomics

Expression proteomics includes the analysis of protein expression at larger scale. It helps identify main proteins in a particular sample, and those proteins differentially expressed in related samples—such as diseased vs. healthy tissue. If a protein is found only in a diseased sample then it can be a useful drug target or diagnostic marker. Proteins with same or similar expression profiles may also be functionally related. There are technologies such as 2D-PAGE and mass spectrometry that are used in expression proteomics.

Biomarkers

The National Institutes of Health has defined a biomarker as “a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.”

Understanding the proteome, the structure and function of each protein and the complexities of protein–protein interactions is critical for developing the most effective diagnostic techniques and disease treatments in the future. For example, proteomics is highly useful in identification of candidate biomarkers (proteins in body fluids that are of value for diagnosis), identification of the bacterial antigens that are targeted by the immune response, and identification of possible immunohistochemistry markers of infectious or neoplastic diseases.

An interesting use of proteomics is using specific protein biomarkers to diagnose disease. A number of techniques allow to test for proteins produced during a particular disease, which helps to diagnose the disease quickly. Techniques include western blot, immunohistochemical staining, enzyme linked immunosorbent assay (ELISA) or mass spectrometry. Secretomics, a subfield of proteomics that studies secreted proteins and secretion pathways using proteomic approaches, has recently emerged as an important tool for the discovery of biomarkers of disease.

Proteogenomics

In proteogenomics, proteomic technologies such as mass spectrometry are used for improving gene annotations. Parallel analysis of the genome and the proteome facilitates discovery of post-translational modifications and proteolytic events, especially when comparing multiple species (comparative proteogenomics).

Structural proteomics

Structural proteomics includes the analysis of protein structures at large-scale. It compares protein structures and helps identify functions of newly discovered genes. The structural analysis also helps to understand that where drugs bind to proteins and also show where proteins interact with each other. This understanding is achieved using different technologies such as X-ray crystallography and NMR spectroscopy.

Bioinformatics for proteomics (proteome informatics)

Much proteomics data is collected with the help of high throughput technologies such as mass spectrometry and microarray. It would often take weeks or months to analyze the data and perform comparisons by hand. For this reason, biologists and chemists are collaborating with computer scientists and mathematicians to create programs and pipeline to computationally analyze the protein data. Using bioinformatics techniques, researchers are capable of faster analysis and data storage. A good place to find lists of current programs and databases is on the ExPASy bioinformatics resource portal. The applications of bioinformatics-based proteomics includes medicine, disease diagnosis, biomarker identification, and many more.

Protein identification

Mass spectrometry and microarray produce peptide fragmentation information but do not give identification of specific proteins present in the original sample. Due to the lack of specific protein identification, past researchers were forced to decipher the peptide fragments themselves. However, there are currently programs available for protein identification. These programs take the peptide sequences output from mass spectrometry and microarray and return information about matching or similar proteins. This is done through algorithms implemented by the program which perform alignments with proteins from known databases such as UniProt  and PROSITE  to predict what proteins are in the sample with a degree of certainty.

Protein structure

The biomolecular structure forms the 3D configuration of the protein. Understanding the protein's structure aids in identification of the protein's interactions and function. It used to be that the 3D structure of proteins could only be determined using X-ray crystallography and NMR spectroscopy. As of 2017, Cryo-electron microscopy is a leading technique, solving difficulties with crystallization (in X-ray crystallography) and conformational ambiguity (in NMR); resolution was 2.2Å as of 2015. Now, through bioinformatics, there are computer programs that can in some cases predict and model the structure of proteins. These programs use the chemical properties of amino acids and structural properties of known proteins to predict the 3D model of sample proteins. This also allows scientists to model protein interactions on a larger scale. In addition, biomedical engineers are developing methods to factor in the flexibility of protein structures to make comparisons and predictions.

Post-translational modifications

Most programs available for protein analysis are not written for proteins that have undergone post-translational modifications. Some programs will accept post-translational modifications to aid in protein identification but then ignore the modification during further protein analysis. It is important to account for these modifications since they can affect the protein's structure. In turn, computational analysis of post-translational modifications has gained the attention of the scientific community. The current post-translational modification programs are only predictive. Chemists, biologists and computer scientists are working together to create and introduce new pipelines that allow for analysis of post-translational modifications that have been experimentally identified for their effect on the protein's structure and function.

Computational methods in studying protein biomarkers

One example of the use of bioinformatics and the use of computational methods is the study of protein biomarkers. Computational predictive models have shown that extensive and diverse feto-maternal protein trafficking occurs during pregnancy and can be readily detected non-invasively in maternal whole blood. This computational approach circumvented a major limitation, the abundance of maternal proteins interfering with the detection of fetal proteins, to fetal proteomic analysis of maternal blood. Computational models can use fetal gene transcripts previously identified in maternal whole blood to create a comprehensive proteomic network of the term neonate. Such work shows that the fetal proteins detected in pregnant woman’s blood originate from a diverse group of tissues and organs from the developing fetus. The proteomic networks contain many biomarkers that are proxies for development and illustrate the potential clinical application of this technology as a way to monitor normal and abnormal fetal development. 

An information theoretic framework has also been introduced for biomarker discovery, integrating biofluid and tissue information. This new approach takes advantage of functional synergy between certain biofluids and tissues with the potential for clinically significant findings not possible if tissues and biofluids were considered individually. By conceptualizing tissue-biofluid as information channels, significant biofluid proxies can be identified and then used for guided development of clinical diagnostics. Candidate biomarkers are then predicted based on information transfer criteria across the tissue-biofluid channels. Significant biofluid-tissue relationships can be used to prioritize clinical validation of biomarkers.

Emerging trends in proteomics

A number of emerging concepts have the potential to improve current features of proteomics. Obtaining absolute quantification of proteins and monitoring post-translational modifications are the two tasks that impact the understanding of protein function in healthy and diseased cells. For many cellular events, the protein concentrations do not change; rather, their function is modulated by post-translational modifications (PTM). Methods of monitoring PTM are an underdeveloped area in proteomics. Selecting a particular subset of protein for analysis substantially reduces protein complexity, making it advantageous for diagnostic purposes where blood is the starting material. Another important aspect of proteomics, yet not addressed, is that proteomics methods should focus on studying proteins in the context of the environment. The increasing use of chemical cross linkers, introduced into living cells to fix protein-protein, protein-DNA and other interactions, may ameliorate this problem partially. The challenge is to identify suitable methods of preserving relevant interactions. Another goal for studying protein is to develop more sophisticated methods to image proteins and other molecules in living cells and real time.

Proteomics for systems biology

Advances in quantitative proteomics would clearly enable more in-depth analysis of cellular systems. Biological systems are subject to a variety of perturbations (cell cycle, cellular differentiation, carcinogenesis, environment (biophysical), etc.). Transcriptional and translational responses to these perturbations results in functional changes to the proteome implicated in response to the stimulus. Therefore, describing and quantifying proteome-wide changes in protein abundance is crucial towards understanding biological phenomenon more holistically, on the level of the entire system. In this way, proteomics can be seen as complementary to genomics, transcriptomics, epigenomics, metabolomics, and other -omics approaches in integrative analyses attempting to define biological phenotypes more comprehensively. As an example, The Cancer Proteome Atlas provides quantitative protein expression data for ~200 proteins in over 4,000 tumor samples with matched transcriptomic and genomic data from The Cancer Genome Atlas. Similar datasets in other cell types, tissue types, and species, particularly using deep shotgun mass spectrometry, will be an immensely important resource for research in fields like cancer biology, developmental and stem cell biology, medicine, and evolutionary biology.

Human plasma proteome

Characterizing the human plasma proteome has become a major goal in the proteomics arena, but it is also the most challenging proteomes of all human tissues. It contains immunoglobulin, cytokines, protein hormones, and secreted proteins indicative of infection on top of resident, hemostatic proteins. It also contains tissue leakage proteins due to the blood circulation through different tissues in the body. The blood thus contains information on the physiological state of all tissues and, combined with its accessibility, makes the blood proteome invaluable for medical purposes. It is thought that characterizing the proteome of blood plasma is a daunting challenge. 

The depth of the plasma proteome encompassing a dynamic range of more than 1010 between the highest abundant protein (albumin) and the lowest (some cytokines) and is thought to be one of the main challenges for proteomics. Temporal and spatial dynamics further complicate the study of human plasma proteome. The turnover of some proteins is quite faster than others and the protein content of an artery may substantially vary from that of a vein. All these differences make even the simplest proteomic task of cataloging the proteome seem out of reach. To tackle this problem, priorities need to be established. Capturing the most meaningful subset of proteins among the entire proteome to generate a diagnostic tool is one such priority. Secondly, since cancer is associated with enhanced glycosylation of proteins, methods that focus on this part of proteins will also be useful. Again: multiparameter analysis best reveals a pathological state. As these technologies improve, the disease profiles should be continually related to respective gene expression changes. Due to the above-mentioned problems plasma proteomics remained challenging. However, technological advancements and continuous developments seem to result in a revival of plasma proteomics as it was shown recently by a technology called plasma proteome profiling. Due to such technologies researchers were able to investigate inflammation processes in mice, the heritability of plasma proteomes as well as to show the effect of such a common life style change like weight loss on the plasma proteome.

Biostatistics

From Wikipedia, the free encyclopedia

Biostatistics are the application of statistics to a wide range of topics in biology. It encompasses the design of biological experiments, especially in medicine, pharmacy, agriculture and fishery; the collection, summarization, and analysis of data from those experiments; and the interpretation of, and inference from, the results. A major branch is medical biostatistics, which is exclusively concerned with medicine and health.

History

Biostatistics and Genetics

Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel’s discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing a infinite series. He called this the theory of "Law of Ancestral Heredity". His ideas were strong disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Walter Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel’s ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis. 

Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.
  1. Ronald Fisher developed several basic statistical methods in support of his work studying the crop experiments at Rothamsted Research, including in his books Statistical Methods for Research Workers (1925) end The Genetical Theory of Natural Selection (1930). He gave many contributions to genetics and statistics. Some of them include the ANOVA, p-value concepts, Fisher’s exact test and Fisher’s equation for population dynamics. He is credited for the sentence “Natural selection is a mechanism for generating an exceedingly high degree of improbability”.
  2. Sewall G. Wright developed F-statistics and methods of computing them and defined inbreeding coefficient.
  3. J. B. S. Haldane's book, The Causes of Evolution, reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. Also developed the theory of primordial soup.
These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled. 

In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study. 

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."

Biostatistics and Medicine

Statistical concepts also are present in clinical trials and epidemiological studies. These fields have their own history of biostatistics developments. In the 18th century, statistical methods were used to decide whether the application of certain treatments were effective, such as the insertion of smallpox pustules under an individual’s skin in the hope of creating a mild case of the disease that would induce later immunity. Since this actually put patients at risk of contracting a potentially fatal form of the disease, this treatment became the subject of much controversy. John Arbuthnot in 1722 studied the chances of people dying by naturally-occurring smallpox as compared to inoculation-induced smallpox. On the basis of the statistical studies, it was concluded that inoculation was preferred. Later, Daniel Bernoulli and Jean d’Alembert developed more robust statistical methods for the same problem, both based on the proportion of people who died.

James Lind were the first to propose groups of test of hypotheses. He applied this method to solve an outbreak of scurvy. For this, he is considered the “father” of clinical trial. In 1835, Pierre-Charles-Alexandre Louis proposed the “numerical method” to argue that the practice of bloodletting was actually doing more harm than good for the patients. In 1840, Louis Denis Jules Gavarret publish the Principes Généraux de Statistique Médicale in which he pointed out that Louis’s averages could vary between what he called “limits of oscillation” (or confidence interval) if multiple samples were taken from the same population. Karl Pearson also expanded his methods to medicine, despite his main goal was to explicit the statistical implications of Darwin’s theory of natural selection.

Research planning

Any research in life sciences is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the experimental design, data collection methods, data analysis perspectives and costs evolved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication, and local control.

Research question

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the scientific question, an exhaustive literature review might be necessary. So, the research can be useful to add value to the scientific community.

Hypothesis definition

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called null hypothesis (H0) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test. In general, HO assumes no association between treatments. On the other hand, the alternative hypothesis is the denial of HO. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.

As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H0 would be that there is no difference between the two diets in mice metabolism (H0: μ1 = μ2) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H1: μ1 ≠ μ2). 

The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (i.e. higher or shorter).

Sampling

Usually, a study aims to understand an effect of a phenomenon over a population. In biology, a population is defined as all the individuals of a given species, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms, as the whole genome, or all the sperm cells, for animals, or the total leaf area, for a plant, for example. 

It is not possible to take the measures from all the elements of a population. Because of that, the sampling process is very important for statistical inference. Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the sample might catch the most variability across a population. The sample size is determined by several things, since the scope of the research to the resources available. In clinical research, the trial type, as inferiority, equivalence, and superiority is a key in determining sample size.

Experimental design

Experimental designs sustain those basic principles of experimental statistics. There are three basic experimental designs to randomly allocate treatments in all plots of the experiment. They are completely randomized design, randomized block design, and factorial designs. Treatments can be arranged in many ways inside the experiment. In agriculture, the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots (plants, livestock, microorganisms). These main arrangements can be found in the literature under the names of “lattices”, “incomplete blocks”, “split plot”, “augmented blocks”, and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference

In clinical studies, the samples are usually smaller than in other biological studies, and in most cases, the environment effect can be controlled or measured. It is common to use randomized controlled clinical trials, where results are usually compared with observational study designs such as case–control or cohort.

Data collection

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.

Data collection varies according to type of data. For qualitative data, collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence. For quantitative data, collection is done by measuring numerical information using instruments. 

In agriculture and biology studies, yield data and its components can be obtained by metric measures. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis.

Analysis and data interpretation

Descriptive Tools

Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:
  • Frequency tables
One type of tables are the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:
 
Absolute: represents the number of times that a determined value appear;
Relative: obtained by the division of the absolute frequency by the total number;
In the next example, we have the number of genes in ten operons of the same organism.


Genes number Absolute frequency Relative frequency
1 0 0
2 1 0.1
3 6 0.6
4 2 0.2
5 1 0.1
  • Line graph
Figure A: Line graph example. The birth rate in Brazil (2010–2016); Figure B: Bar chart example. The birth rate in Brazil for the December months from 2010 to 2016; Figure C: Example of Box Plot: number of glycines in the proteome of eight different organisms (A-H); Figure D: Example of a scatter plot.

Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.
  • Bar chart
A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.
 
In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016. The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.
  • Histograms
Example of a histogram.

The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson.
  • Scatter Plot
A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis. They are also called also called scatter graph, scatter chart, scattergram, or scatter diagram.
  • Mean
The arithmetic mean is the sum of a collection of values () divided by the number of items of this collection (). 

  • Median
The median is the value in the middle of a dataset.
  • Mode
The mode is the value of a set of data that appears most often.
 
Comparison among mean, median and mode Values = { 2,3,3,3,3,3,3,4,4,11 }
Type Example Result
Mean( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9 4
Median 2, 3, 3, 3, 3, 3, 4, 4, 11 3
Mode 2, 3, 3, 3, 3, 3, 4, 4, 11 3
  • Box Plot
Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.
  • Correlation Coefficients
Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association.
  • Pearson Correlation Coefficient
Scatter diagram that demonstrates the Pearson correlation for different values of ρ.

Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for the population and r for the sample, assumes values between −1 and 1, where ρ = 1 represents a perfect positive correlation, ρ = -1 represents a perfect negative correlation, and ρ = 0 is no linear correlation.

Inferential Statistics

It is used to make inferences about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences.

Hypothesis testing

Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:
  1. The hypothesis to be tested: as stated earlier, we have to work with the definition of a null hypothesis (H0), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation.
  2. Significance level and decision rule: A decision rule depends on the level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a critical value that determines the statistical significance when a test statistic is compared with it. So, α also has to be predefined before the experiment.
  3. Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate experimental design, data is collected and the more suitable statistical tests are evaluated.
  4. Inference: Is made when the null hypothesis is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.
Confidence intervals

A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.

Statistical considerations

Power and statistical error

When testing a hypothesis, there are two types of statistic errors possible: Type I error and Type II error. The type I error or false positive is the incorrect rejection of a true null hypothesis and the type II error or false negative is the failure to reject a false null hypothesis. The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β.

p-value

The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis (H0) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H0) is rejected.

Multiple testing

In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.

Mis-specification and robustness checks

The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification. Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.

Model selection criteria

Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.

Developments and Big Data

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, bioinformatics and machine learning.

Use in high-throughput data

New biomedical technologies like microarrays, next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously. Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.

Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R2-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R2 of the validation test set, not those of the training set. 

Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes. These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.

Bioinformatics advances in databases, data mining, and biological interpretation

The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as PubMed. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs (dbSNP), the knowledge on genes characterization and their pathways (KEGG) and the description of gene function classifying it by cellular component, molecular function and biological process. In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains lots of data about it, is the Arabidopsis thaliana genetic and molecular database – TAIR. Phytozome, in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC) which relates data from DDBJ, EMBL-EBI, and NCBI

Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by machine learning area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, among others. To indicate some of them, self-organizing maps and k-means are examples of cluster algorithms; neural networks implementation and support vector machines models are examples of common machine learning algorithms. 

Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientist is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.

Use of computationally intensive methods

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods. 

In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.[citation needed]

Applications

Public health

Public health, including epidemiology, health services research, nutrition, environmental health and health care policy & management. In these medicine contents, it's important to consider the design and analysis of the clinical trials. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease. 

With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made a integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.

Quantitative genetics

The study of Population genetics and Statistical genetics in order to link variation in genotype with a variation in phenotype. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called Quantitative trait locus (QTL). The study of QTLs become feasible by using molecular markers and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or Recombinant inbred strains/lines (RILs). To scan for QTLs regions in a genome, a gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.

However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population. For this reason, the Genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping.

In animal and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of marker-assisted selection. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotyped and but not phenotyped population, called testing population. This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model. 

In summary, some points about the application of quantitative genetics are:

Expression data

Studies for differential expression of genes from RNA-Seq data, as for RT-qPCR and microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered. Some examples of other analysis on genomics data comes from microarray or proteomics experiments. Often concerning diseases or disease stages.

Other studies

Tools

There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications. Here are brief descriptions of some of them:
  • CycDesigN: A computer package developed by VSNi that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design.
  • SAS: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (SAS Institute), it uses SAS language for programming.
  • R: An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN. In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications. In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as Bioconductor. It is also possible to use packages under development that are shared in hosting-services as GitHub.
  • ASReml: Another software developed by VSNi that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures.
  • Weka: A Java software for machine learning and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparision. Weka also can be run in other programming languages as Perl or R.
  • Orange: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.

Scope and training programs

Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics. 

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

1947–1948 civil war in Mandatory Palestine

From Wikipedia, the free encyclopedia During the civil war, the Jewish and Arab communities of Palestine clashed (the latter supported b...