A Medley of Potpourri

Thursday, October 25, 2018

Computational phylogenetics

From Wikipedia, the free encyclopedia

Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species and the relationships between specific genes shared by many types of organisms. Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed. The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.

Types of phylogenetic trees and networks

Phylogenetic trees generated by computational phylogenetics can be either rooted or unrooted depending on the input data and the algorithm used. A rooted tree is a directed graph that explicitly identifies a most recent common ancestor (MRCA), usually an imputed sequence that is not represented in the input. Genetic distance measures can be used to plot a tree with the input sequences as leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion in the input data of at least one "outgroup" known to be only distantly related to the sequences of interest.

By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.

The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.

Both rooted and unrooted phylogenetic trees can be further generalized to rooted or unrooted phylogenetic networks, which allow for the modeling of evolutionary phenomena such as hybridization or horizontal gene transfer.

Coding characters and defining homology

Morphological analysis

The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant. Morphological studies can be confounded by examples of convergent evolution of phenotypes. A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype's variation. The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records, but has been shown to have a significant effect on the trees produced; in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data.

Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.

Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.

Molecular analysis

The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment. For a given gapped MSA, several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignments. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.

UPGMA and WPGMA

The UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA (Weighted Pair Group Method with Arithmetic mean) methods produce rooted trees and require a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.

Neighbor-joining

Neighbor-joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages.

Fitch-Margoliash method

The Fitch-Margoliash method uses a weighted least squares method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be linear; the linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches. Another modification of the algorithm can be helpful, especially in case of concentrated distances (please report to concentration of measure phenomenon and curse of dimensionality): that modification, described in, has been shown to improve the efficiency of the algorithm and its robustness.

The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete, so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Using outgroups

Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set. This usage can be seen as a type of experimental control. If the outgroup has been appropriately chosen, it will have a much greater genetic distance and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds noise to the analysis. Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly conserved across lineages. Horizontal gene transfer, especially between otherwise divergent bacteria, can also confound outgroup usage.

Maximum parsimony

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.

The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard; consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the best in the set. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.

Branch and bound

The branch and bound algorithm is a general method used to increase the efficiency of searches for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980s. Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, thereby assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rules severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

Sankoff-Morel-Cedergren algorithm

The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences. The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thereby favoring the tree that introduces a minimal number of such events (an alternative view holds that the trees to be favored are those that maximize the amount of sequence similarity that can be interpreted as homology, a point of view that may lead to different optimal trees). The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.

MALIGN and POY

More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA. However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events. This, in turn, has been countered by the view that such methods should be seen as heuristic approaches to find the trees that maximize the amount of sequence similarity that can be interpreted as homology.

Maximum likelihood

The maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations; roughly, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. This is broadly similar to the maximum-parsimony method, but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites. In fact, the method requires that evolution at different sites and along different lineages must be statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but it is believed to be computationally intractable to compute due to its NP-hardness.

The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees. The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton-Raphson method are often used.

Bayesian inference

Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods.

Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step and swapping descendant subtrees of a random internal node between two related trees. The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work. Bayesian methods are generally held to be superior to parsimony-based methods; they can be more prone to long-branch attraction than maximum likelihood techniques, although they are better able to accommodate missing data.

Whereas likelihood methods find the tree that maximizes the probability of the data, a Bayesian approach recovers a tree that represents the most likely clades, by drawing on the posterior distribution. However, estimates of the posterior probability of clades (measuring their 'support') can be quite wide of the mark, especially in clades that aren't overwhelmingly likely. As such, other methods have been put forwards to estimate posterior probability.

Model selection

Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied. At their simplest, substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other (alternatively, the distance is linear only shortly before coalescence). The longer the amount of time after divergence, the more likely it becomes that two mutations occur at the same nucleotide site. Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history. The extent of this undercount increases with increasing time since divergence, which can lead to the phenomenon of long branch attraction, or the misassignment of two distantly related but convergently evolving sequences as closely related. The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events.

Types of models

All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate. More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages. One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.

Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code. A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution. Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.

Choosing the best model

The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit. The most common method of model selection is the likelihood ratio test (LRT), which produces a likelihood estimate that can be interpreted as a measure of "goodness of fit" between the model and the input data. However, care must be taken in using these results, since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model, which can lead to the naive selection of models that are overly complex. For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models. A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models; it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected.

An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback–Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models. The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.

A comprehensive step-by-step protocol on constructing phylogenetic tree, including DNA/Amino Acid contiguous sequence assembly, multiple sequence alignment, model-test (testing best-fitting substitution models) and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Nature Protocol

A non traditional way of evaluating the phylogenetic tree is to compare it with clustering result. One can use a Multidimensional Scaling technique, so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the sequences in 3D, and then map the phylogenetic tree onto the clustering result. A better tree usually has a higher correlation with the clustering result.

Evaluating tree support

As with all statistical analysis, the estimation of phylogenies from character data requires an evaluation of confidence. A number of methods exist to test the amount of support for a phylogenetic tree, either by evaluating the support for each sub-tree in the phylogeny (nodal support) or evaluating whether the phylogeny is significantly different from other possible trees (alternative tree hypothesis tests).

Nodal support

The most common method for assessing tree support is to evaluate the statistical support for each node on the tree. Typically, a node with very low support is not considered valid in further analysis, and visually may be collapsed into a polytomy to indicate that relationships within a clade are unresolved.

Consensus tree

Many methods for assessing nodal support involve consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees. In a *strict consensus,* only nodes found in every tree are shown, and the rest are collapsed into an unresolved polytomy. Less conservative methods, such as the *majority-rule consensus* tree, consider nodes that are supported by a given percentage of trees under consideration (such as at least 50%).

For example, in maximum parsimony analysis, there may be many trees with the same parsimony score. A strict consensus tree would show which nodes are found in all equally parsimonious trees, and which nodes differ. Consensus trees are also used to evaluate support on phylogenies reconstructed with Bayesian inference (see below).

Bootstrapping and jackknifing

In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, given a set of 100 data points, a pseudoreplicate is a data set of the same size (100 points) randomly sampled from the original data, with replacement. That is, each original data point may be represented more than once in the pseudoreplicate, or not at all. Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates.

In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node.

The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories, finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this was tested under ideal conditions (e.g. no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are generally supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved.

Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsampling the data—for example, a "10% jackknife" would involve randomly sampling 10% of the matrix many times to evaluate nodal support.

Posterior probability

Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single "best" tree. The trees in the posterior distribution generally have many different topologies. Most Bayesian inference methods utilize a Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded as burn-in. The most common method of evaluating nodal support in a Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution (post-burn-in) which contain the node.

The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model. Therefore, the threshold for accepting a node as supported is generally higher than for bootstrapping.

Step counting methods

Bremer support counts the number of extra steps needed to contradict a clade.

Shortcomings

These measures each have their weaknesses. For example, smaller or larger clades tend to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them.

Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade.

Limitations and workarounds

Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:

Homoplasy

Certain characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree. Weights in the form of a model of evolution can be inferred from sets of molecular data, so that maximum likelihood or Bayesian methods can be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As time since the divergence of two taxa increase, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only objective way to determine convergence is by the construction of a tree – a somewhat circular method. Even so, weighting homoplasious characters does indeed lead to better-supported trees. Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings almost guarantees placement among the pterygote insects because, although wings are often lost secondarily, there is no evidence that they have been gained more than once.

Horizontal gene transfer

In general, organisms can inherit genes in two ways: vertical gene transfer and horizontal gene transfer. Vertical gene transfer is the passage of genes from parent to offspring, and horizontal (also called lateral) gene transfer occurs when genes jump between unrelated organisms, a common phenomenon especially in prokaryotes; a good example of this is the acquired antibiotic resistance as a result of gene exchange between various bacteria leading to multi-drug-resistant bacterial species. There have also been well-documented cases of horizontal gene transfer between eukaryotes.
Horizontal gene transfer has complicated the determination of phylogenies of organisms, and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees. The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically; this requires analyzing a large number of genes.

Hybrids, speciation, introgressions and incomplete lineage sorting

The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown that hybrid speciation, once thought rare, is in fact quite common, particularly in plants. Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees. Introgression can also move genes between otherwise distinct species and sometimes even genera, complicating phylogenetic analysis based on genes. This phenomenon can contribute to "incomplete lineage sorting" and is thought to be a common phenomenon across a number of groups. In species level analysis this can be dealt with by larger sampling or better whole genome analysis. Often the problem is avoided by restricting the analysis to fewer, not closely related specimens.

Taxon sampling

Owing to the development of advanced sequencing techniques in molecular biology, it has become feasible to gather large amounts of data (DNA or amino acid sequences) to infer phylogenetic hypotheses. For example, it is not rare to find studies with character matrices based on whole mitochondrial genomes (~16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters, because the more taxa there are, the more accurate and more robust is the resulting phylogenetic tree. This may be partly due to the breaking up of long branches.

Phylogenetic signal

Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.

Continuous characters

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding. In the original form of gap coding:

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated ... and differences between adjacent means ... are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores ... if the means are separated by a "gap" greater than the within-group standard deviation ... times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.

Missing data

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.

The role of fossils

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa. Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record; stratocladistics incorporates age information into data matrices for phylogenetic analyses.

Computational biology

From Wikipedia, the free encyclopedia

Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, ecological, behavioral, and social systems. The field is broadly defined and includes foundations in biology, applied mathematics, statistics, biochemistry, chemistry, biophysics, molecular biology, genetics, genomics, computer science and evolution.

Computational biology is different from biological computing, which is a subfield of computer science and computer engineering using bioengineering and biology to build computers, but is similar to bioinformatics, which is an interdisciplinary science using computers to store and process biological data.

Introduction

Computational Biology, which includes many aspects of bioinformatics, is the science of using biological data to develop algorithms or models to understand biological systems and relationships. Until recently, biologists did not have access to very large amounts of data. This data has now become commonplace, particularly in molecular biology and genomics. Researchers were able to develop analytical methods for interpreting biological information, but were unable to share them quickly among colleagues.

Bioinformatics began to develop in the early 1970s. It was considered the science of analyzing informatics processes of various biological systems. At this time, research in artificial intelligence was using network models of the human brain in order to generate new algorithms. This use of biological data to develop other fields pushed biological researchers to revisit the idea of using computers to evaluate and compare large data sets. By 1982, information was being shared among researchers through the use of punch cards. The amount of data being shared began to grow exponentially by the end of the 1980s. This required the development of new computational methods in order to quickly analyze and interpret relevant information.

Since the late 1990s, computational biology has become an important part of developing emerging technologies for the field of biology. The terms computational biology and evolutionary computation have a similar name, but are not to be confused. Unlike computational biology, evolutionary computation is not concerned with modeling and analyzing biological data. It instead creates algorithms based on the ideas of evolution across species. Sometimes referred to as genetic algorithms, the research of this field can be applied to computational biology. While evolutionary computation is not inherently a part of computational biology, Computational evolutionary biology is a subfield of it.

Computational biology has been used to help sequence the human genome, create accurate models of the human brain, and assist in modeling biological systems.

Subfields

Computational anatomy

Computational anatomy is a discipline focusing on the study of anatomical shape and form at the visible or gross anatomical

50-100\mu

scale of morphology. It involves the development and application of computational, mathematical and data-analytical methods for modeling and simulation of biological structures. It focuses on the anatomical structures being imaged, rather than the medical imaging devices. Due to the availability of dense 3D measurements via technologies such as magnetic resonance imaging (MRI), computational anatomy has emerged as a subfield of medical imaging and bioengineering for extracting anatomical coordinate systems at the morphome scale in 3D.

The original formulation of computational anatomy is as a generative model of shape and form from exemplars acted upon via transformations. The diffeomorphism group is used to study different coordinate systems via coordinate transformations as generated via the Lagrangian and Eulerian velocities of flow from one anatomical configuration in

{\mathbb {R} }^{3}

to another. It relates with shape statistics and morphometrics, with the distinction that diffeomorphisms are used to map coordinate systems, whose study is known as diffeomorphometry.

Computational biomodeling

Computational biomodeling is a field concerned with building computer models of biological systems. Computational biomodeling aims to develop and use visual simulations in order to assess the complexity of biological systems. This is accomplished through the use of specialized algorithms, and visualization software. These models allow for prediction of how systems will react under different environments. This is useful for determining if a system is robust. A robust biological system is one that “maintain their state and functions against external and internal perturbations”, which is essential for a biological system to survive. Computational biomodeling generates a large archive of such data, allowing for analysis from multiple users. While current techniques focus on small biological systems, researchers are working on approaches that will allow for larger networks to be analyzed and modeled. A majority of researchers believe that this will be essential in developing modern medical approaches to creating new drugs and gene therapy. A useful modelling approach is to use Petri nets via tools such as esyN

Computational genomics (Computational genetics)

A partially sequenced genome.

Computational genomics is a field within genomics which studies the genomes of cells and organisms. It is sometimes referred to as Computational and Statistical Genetics and encompasses much of bioinformatics. The Human Genome Project is one example of computational genomics. This project looks to sequence the entire human genome into a set of data. Once fully implemented, this could allow for doctors to analyze the genome of an individual patient. This opens the possibility of personalized medicine, prescribing treatments based on an individual’s pre-existing genetic patterns. This project has created many similar programs. Researchers are looking to sequence the genomes of animals, plants, bacteria, and all other types of life.

One of the main ways that genomes are compared is by sequence homology. Homology is the study of biological structures and nucleotide sequences in different organisms that come from a common ancestor. Research suggests that between 80 and 90% of genes in newly sequenced prokaryotic genomes can be identified this way.

This field is still in development. An untouched project in the development of computational genomics is the analysis of intergenic regions. Studies show that roughly 97% of the human genome consists of these regions. Researchers in computational genomics are working on understanding the functions of non-coding regions of the human genome through the development of computational and statistical methods and via large consortia projects such as ENCODE (The Encyclopedia of DNA Elements) and the Roadmap Epigenomics Project.

Computational neuroscience

Computational neuroscience is the study of brain function in terms of the information processing properties of the structures that make up the nervous system. It is a subset of the field of neuroscience, and looks to analyze brain data to create practical applications. It looks to model the brain in order to examine specific types aspects of the neurological system. Various types of models of the brain include:

Realistic Brain Models: These models look to represent every aspect of the brain, including as much detail at the cellular level as possible. Realistic models provide the most information about the brain, but also have the largest margin for error. More variables in a brain model create the possibility for more error to occur. These models do not account for parts of the cellular structure that scientists do not know about. Realistic brain models are the most computationally heavy and the most expensive to implement.
Simplifying Brain Models: These models look to limit the scope of a model in order to assess a specific physical property of the neurological system. This allows for the intensive computational problems to be solved, and reduces the amount of potential error from a realistic brain model.

It is the work of computational neuroscientists to improve the algorithms and data structures currently used to increase the speed of such calculations.

Computational pharmacology

Computational pharmacology (from a computational biology perspective) is “the study of the effects of genomic data to find links between specific genotypes and diseases and then screening drug data”. The pharmaceutical industry requires a shift in methods to analyze drug data. Pharmacologists were able to use Microsoft Excel to compare chemical and genomic data related to the effectiveness of drugs. However, the industry has reached what is referred to as the Excel barricade. This arises from the limited number of cells accessible on a spreadsheet. This development led to the need for computational pharmacology. Scientists and researchers develop computational methods to analyze these massive data sets. This allows for an efficient comparison between the notable data points and allows for more accurate drugs to be developed.

Analysts project that if major medications fail due to patents, that computational biology will be necessary to replace current drugs on the market. Doctoral students in computational biology are being encouraged to pursue careers in industry rather than take Post-Doctoral positions. This is a direct result of major pharmaceutical companies needing more qualified analysts of the large data sets required for producing new drugs.

Computational evolutionary biology

Computational biology has assisted the field of evolutionary biology in many capacities. This includes:

Using DNA data to reconstruct the tree of life with computational phylogenetics
Fitting population genetics models (either forward time or backward time) to DNA data to make inferences about demographic or selective history
Building population genetics models of evolutionary systems from first principles in order to predict what is likely to evolve.

Cancer computational biology

Cancer computational biology is a field that aims to determine the future mutations in cancer through an algorithmic approach to analyzing data. Research in this field has led to the use of high-throughput measurement. High throughput measurement allows for the gathering of millions of data points using robotics and other sensing devices. This data is collected from DNA, RNA, and other biological structures. Areas of focus include determining the characteristics of tumors, analyzing molecules that are deterministic in causing cancer, and understanding how the human genome relates to the causation of tumors and cancer.

Computational neuropsychiatry

Computational neuropsychiatry is the emerging field that uses mathematical and computer-assisted modeling of brain mechanisms involved in mental disorders. It was already demonstrated by several initiatives that computational modeling is an important contribution to understand neuronal circuits that could generate mental functions and dysfunctions.

Software and tools

Computational Biologists use a wide range of software. These range from command line programs to graphical and web-based programs.

Open source software

Open source software provides a platform to develop computational biological methods. Specifically, open source means that every person and/or entity can access and benefit from software developed in research. PLOS cites four main reasons for the use of open source software including:

Reproducibility: This allows for researchers to use the exact methods used to calculate the relations between biological data.
Faster Development: developers and researchers do not have to reinvent existing code for minor tasks. Instead they can use pre-existing programs to save time on the development and implementation of larger projects.
Increased quality: Having input from multiple researchers studying the same topic provides a layer of assurance that errors will not be in the code.
Long-term availability: Open source programs are not tied to any businesses or patents. This allows for them to be posted to multiple web pages and ensure that they are available in the future.

Conferences

There are several large conferences that are concerned with computational biology. Some notable examples are Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB) and Research in Computational Molecular Biology (RECOMB).

Journals

There are numerous journals dedicated to computational biology. Some notable examples include Journal of Computational Biology and PLOS Computational Biology. The PLOS computational biology journal is a peer-reviewed journal that has many notable research projects in the field of computational biology. They provide reviews on software, tutorials for open source software, and display information on upcoming computational biology conferences. PLOS Computational Biology is an open access journal. The publication may be openly used provided the author is cited. Recently a new open access journal Computational Molecular Biology was launched.

Related fields

Computational biology, bioinformatics and mathematical biology are all interdisciplinary approaches to the life sciences that draw from quantitative disciplines such as mathematics and information science. The NIH describes computational/mathematical biology as the use of computational/mathematical approaches to address theoretical and experimental questions in biology and, by contrast, bioinformatics as the application of information science to understand complex life-sciences data.

Specifically, the NIH defines

Computational biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

While each field is distinct, there may be significant overlap at their interface.

Personalized medicine

From Wikipedia, the free encyclopedia

Personalized medicine, precision medicine, or theranostics is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on their predicted response or risk of disease. The terms personalized medicine, precision medicine, stratified medicine and P4 medicine are used interchangeably to describe this concept though some authors and organisations use these expressions separately to indicate particular nuances.

While the tailoring of treatment to patients dates back at least to the time of Hippocrates, the term has risen in usage in recent years given the growth of new diagnostic and informatics approaches that provide understanding of the molecular basis of disease, particularly genomics. This provides a clear evidence base on which to stratify (group) related patients.

Development of concept

In personalised medicine, diagnostic testing is often employed for selecting appropriate and optimal therapies based on the context of a patient’s genetic content or other molecular or cellular analysis. The use of genetic information has played a major role in certain aspects of personalized medicine (e.g. pharmacogenomics), and the term was first coined in the context of genetics, though it has since broadened to encompass all sorts of personalization measures.

Background

Basics

Every person has a unique variation of the human genome. Although most of the variation between individuals has no effect on health, an individual's health stems from genetic variation with behaviors and influences from the environment.

Modern advances in personalized medicine rely on technology that confirms a patient's fundamental biology, DNA, RNA, or protein, which ultimately leads to confirming disease. For example, personalised techniques such as genome sequencing can reveal mutations in DNA that influence diseases ranging from cystic fibrosis to cancer. Another method, called RNA-seq, can show which RNA molecules are involved with specific diseases. Unlike DNA, levels of RNA can change in response to the environment. Therefore, sequencing RNA can provide a broader understanding of a person’s state of health. Recent studies have linked genetic differences between individuals to RNA expression, translation, and protein levels.

The concepts of personalised medicine can be applied to new and transformative approaches to health care. Personalised health care is based on the dynamics of systems biology and uses predictive tools to evaluate health risks and to design personalised health plans to help patients mitigate risks, prevent disease and to treat it with precision when it occurs. The concepts of personalised health care are receiving increasing acceptance with the Veterans Administration committing to personalised, proactive patient driven care for all veterans.

Method

In order for physicians to know if a mutation is connected to a certain disease, researchers often do a study called a “genome-wide association study” (GWAS). A GWAS study will look at one disease, and then sequence the genome of many patients with that particular disease to look for shared mutations in the genome. Mutations that are determined to be related to a disease by a GWAS study can then be used to diagnose that disease in future patients, by looking at their genome sequence to find that same mutation. The first GWAS, conducted in 2005, studied patients with age-related macular degeneration (ARMD). It found two different mutations, each containing only a variation in only one nucleotide (called single nucleotide polymorphisms, or SNPs), which were associated with ARMD. GWAS studies like this have been very successful in identifying common genetic variations associated with diseases. As of early 2014, over 1,300 GWAS studies have been completed.

Disease risk assessment

Multiple genes collectively influence the likelihood of developing many common and complex diseases. Personalised medicine can also be used to predict a person’s risk for a particular disease, based on one or even several genes. This approach uses the same sequencing technology to focus on the evaluation of disease risk, allowing the physician to initiate preventative treatment before the disease presents itself in their patient. For example, if it is found that a DNA mutation increases a person’s risk of developing Type 2 Diabetes, this individual can begin lifestyle changes that will lessen their chances of developing Type 2 Diabetes later in life.

Applications

Advances in personalised medicine will create a more unified treatment approach specific to the individual and their genome. Personalised medicine may provide better diagnoses with earlier intervention, and more efficient drug development and therapies.

Diagnosis and intervention

Having the ability to look at a patient on an individual basis will allow for a more accurate diagnosis and specific treatment plan. Genotyping is the process of obtaining an individual’s DNA sequence by using biological assays. By having a detailed account of an individual’s DNA sequence, their genome can then be compared to a reference genome, like that of the Human Genome Project, to assess the existing genetic variations that can account for possible diseases. A number of private companies, such as 23andMe, Navigenics, and Illumina, have created Direct-to-Consumer genome sequencing accessible to the public. Having this information from individuals can then be applied to effectively treat them. An individual’s genetic make-up also plays a large role in how well they respond to a certain treatment, and therefore, knowing their genetic content can change the type of treatment they receive.

An aspect of this is pharmacogenomics, which uses an individual’s genome to provide a more informed and tailored drug prescription. Often, drugs are prescribed with the idea that it will work relatively the same for everyone, but in the application of drugs, there are a number of factors that must be considered. The detailed account of genetic information from the individual will help prevent adverse events, allow for appropriate dosages, and create maximum efficacy with drug prescriptions. The pharmacogenomic process for discovery of genetic variants that predict adverse events to a specific drug has been termed toxgnostics.

An aspect of a theranostic platform applied to personalized medicine can be the use of diagnostic tests to guide therapy. The tests may involve medical imaging such as MRI contrast agents (T1 and T2 agents), fluorescent markers (organic dyes and inorganic quantum dots), and nuclear imaging agents (PET radiotracers or SPECT agents). or in vitro lab test including DNA sequencing and often involve deep learning algorithms that weigh the result of testing for several biomarkers.

In addition to specific treatment, personalised medicine can greatly aid the advancements of preventive care. For instance, many women are already being genotyped for certain mutations in the BRCA1 and BRCA2 gene if they are predisposed because of a family history of breast cancer or ovarian cancer. As more causes of diseases are mapped out according to mutations that exist within a genome, the easier they can be identified in an individual. Measures can then be taken to prevent a disease from developing. Even if mutations were found within a genome, having the details of their DNA can reduce the impact or delay the onset of certain diseases. Having the genetic content of an individual will allow better guided decisions in determining the source of the disease and thus treating it or preventing its progression. This will be extremely useful for diseases like Alzheimer’s or cancers that are thought to be linked to certain mutations in our DNA.

A tool that is being used now to test efficacy and safety of a drug specific to a targeted patient group/sub-group is companion diagnostics. This technology is an assay that is developed during or after a drug is made available on the market and is helpful in enhancing the therapeutic treatment available based on the individual. These companion diagnostics have incorporated the pharmacogenomic information related to the drug into their prescription label in an effort to assist in making the most optimal treatment decision possible for the patient.

Drug development and usage

Having an individual’s genomic information can be significant in the process of developing drugs as they await approval from the FDA for public use. Having a detailed account of an individual’s genetic make-up can be a major asset in deciding if a patient can be chosen for inclusion or exclusion in the final stages of a clinical trial. Being able to identify patients who will benefit most from a clinical trial will increase the safety of patients from adverse outcomes caused by the product in testing, and will allow smaller and faster trials that lead to lower overall costs. In addition, drugs that are deemed ineffective for the larger population can gain approval by the FDA by using personal genomes to qualify the effectiveness and need for that specific drug or therapy even though it may only be needed by a small percentage of the population.

Today in medicine, it is common that physicians often use a trial and error strategy until they find the treatment therapy that is most effective for their patient. With personalised medicine, these treatments can be more specifically tailored to an individual and give insight into how their body will respond to the drug and if that drug will work based on their genome. The personal genotype can allow physicians to have more detailed information that will guide them in their decision in treatment prescriptions, which will be more cost-effective and accurate. As quoted from the article Pharmacogenomics: The Promise of Personalised Medicine, “therapy with the right drug at the right dose in the right patient” is a description of how personalized medicine will affect the future of treatment. For instance, tamoxifen used to be a drug commonly prescribed to women with ER+ breast cancer, but 65% of women initially taking it developed resistance. After some research by people such as David Flockhart, it was discovered that women with certain mutation in their CYP2D6 gene, a gene that encodes the metabolizing enzyme, were not able to efficiently break down Tamoxifen, making it an ineffective treatment for their cancer. Since then, women are now genotyped for those specific mutations, so that immediately these women can have the most effective treatment therapy.
Screening for these mutations is carried out via high-throughput screening or phenotypic screening. Several drug discovery and pharmaceutical companies are currently utilizing these technologies to not only advance the study of personalised medicine, but also to amplify genetic research; these companies include Alacris Theranostics, Persomics, Flatiron Health, Novartis, OncoDNA and Foundation Medicine, among others. Alternative multi-target approaches to the traditional approach of "forward" transfection library screening can entail reverse transfection or chemogenomics.

Pharmacy compounding is yet another application of personalised medicine. Though not necessarily utilizing genetic information, the customized production of a drug whose various properties (e.g. dose level, ingredient selection, route of administration, etc.) are selected and crafted for an individual patient is accepted as an area of personalised medicine (in contrast to mass-produced unit doses or fixed-dose combinations).

Respiratory Proteomics

Respiratory diseases affect humanity globally, with chronic lung diseases (e.g., asthma, chronic obstructive pulmonary disease, idiopathic pulmonary fibrosis, among others)and lung cancer causing extensive morbidity and mortality. These conditions are highly heterogeneous and require an early diagnosis. However, initial symptoms are nonspecific,and the clinical diagnosis is made late frequently. Over the last few years, personalized medicine has emerged as a medical care approach that uses novel technology aiming to personalize treatments according to the particular patient's medical needs.

Cancer genomics

Over recent decades cancer research has discovered a great deal about the genetic variety of types of cancer that appear the same in traditional pathology. There has also been increasing awareness of tumour heterogeneity, or genetic diversity within a single tumour. Among other prospects, these discoveries raise the possibility of finding that drugs that have not given good results applied to a general population of cases may yet be successful for a proportion of cases with particular genetic profiles.

Cancer Genomics, or “Oncogenomics,” is the application of genomics and personalized medicine to cancer research and treatment. High-throughput sequencing methods are used to characterize genes associated with cancer to better understand disease pathology and improve drug development. Oncogenomics is one of the most promising branches of genomics, particularly because of its implications in drug therapy. Examples of this include:

Trastuzumab (trade names Herclon, Herceptin) is a monoclonal antibody drug that interferes with the HER2/neu receptor. Its main use is to treat certain breast cancers. This drug is only used if a patient's cancer is tested for over-expression of the HER2/neu receptor. Two tissue-typing tests are used to screen patients for possible benefit from Herceptin treatment. The tissue tests are immunohistochemistry(IHC) and Fluorescence In Situ Hybridization(FISH) Only Her2+ patients will be treated with Herceptin therapy (trastuzumab)
Tyrosine kinase inhibitors such as imatinib (marketed as Gleevec) have been developed to treat chronic myeloid leukemia (CML), in which the BCR-ABL fusion gene (the product of a reciprocal translocation between chromosome 9 and chromosome 22) is present in >95% of cases and produces hyperactivated abl-driven protein signaling. These medications specifically inhibit the Ableson tyrosine kinase (ABL) protein and are thus a prime example of "rational drug design" based on knowledge of disease pathophysiology.

Challenges

As personalised medicine is practiced more widely, a number of challenges arise. The current approaches to intellectual property rights, reimbursement policies, patient privacy and confidentiality as well as regulatory oversight will have to be redefined and restructured to accommodate the changes personalised medicine will bring to healthcare. Furthermore, the analysis of acquired diagnostic data is a recent challenge of personalized medicine and its adoption. For example, genetic data obtained from next-generation sequencing requires computer-intensive data processing prior to its analysis. In the future, adequate tools will be required to accelerate the adoption of personalised medicine to further fields of medicine, which requires the interdisciplinary cooperation of experts from specific fields of research, such as medicine, clinical oncology, biology, and artificial intelligence.

Regulatory oversight

The FDA has already started to take initiatives to integrate personalised medicine into their regulatory policies. An FDA report in October 2013 entitled, “Paving the Way for Personalized Medicine: FDA’s role in a New Era of Medical Product Development,” in which they outlined steps they would have to take to integrate genetic and biomarker information for clinical use and drug development. They determined that they would have to develop specific regulatory science standards, research methods, reference material and other tools in order to incorporate personalised medicine into their current regulatory practices. For example, they are working on a “genomic reference library” for regulatory agencies to compare and test the validity of different sequencing platforms in an effort to uphold reliability.

Intellectual property rights

As with any innovation in medicine, investment and interest in personalised medicine is influenced by intellectual property rights. There has been a lot of controversy regarding patent protection for diagnostic tools, genes, and biomarkers. In June 2013, the U.S Supreme Court ruled that natural occurring genes cannot be patented, while “synthetic DNA” that is edited or artificially- created can still be patented. The Patent Office is currently reviewing a number of issues related to patent laws for personalised medicine, such as whether “confirmatory” secondary genetic tests post initial diagnosis, can have full immunity from patent laws. Those who oppose patents argue that patents on DNA sequences are an impediment to ongoing research while proponents point to research exemption and stress that patents are necessary to entice and protect the financial investments required for commercial research and the development and advancement of services offered.

Supply chain

There can be significant challenges in producing and delivering personalised medicine to patients from a supply chain perspective. Historically, pharmaceutical manufacturing is done in large batch production. For many of the personalised medicine therapies, the technology is single-use requiring flexible and small batch sized production. Also, because these therapies are individual, there is no room for error in production, storage, and transportation of these medicines. Controls need to be put into the supply chain to eliminate the potential for errors - an expensive proposition for pharmaceutical companies. In addition to the impact on the production process, the extended supply chain will also be significantly impacted. Third-party providers, like logistics services, will need to get closer to the patient to coordinate and schedule pickup and delivery from apheresis sites.

Reimbursement policies

Reimbursement policies will have to be redefined to fit the changes that personalised medicine will bring to the healthcare system. Some of the factors that should be considered are the level of efficacy of various genetic tests in the general population, cost-effectiveness relative to benefits, how to deal with payment systems for extremely rare conditions, and how to redefine the insurance concept of “shared risk” to incorporate the effect of the newer concept of “individual risk factors".

Patient privacy and confidentiality

Perhaps the most critical issue with the commercialization of personalised medicine is the protection of patients. One of the largest issues is the fear and potential consequences for patients who are predisposed after genetic testing or found to be non-responsive towards certain treatments. This includes the psychological effects on patients due to genetic testing results. The right of family members who do not directly consent is another issue, considering that genetic predispositions and risks are inheritable. The implications for certain ethnic groups and presence of a common allele would also have to be considered. In 2008, the Genetic Information Nondiscrimination Act (GINA) was passed in an effort to minimize the fear of patients participating in genetic research by ensuring that their genetic information will not be misused by employers or insurers. On February 19, 2015 FDA issued a press release titled: "FDA permits marketing of first direct-to-consumer genetic carrier test for Bloom syndrome.

Wednesday, October 24, 2018

Whole genome sequencing

From Wikipedia, the free encyclopedia

Electropherograms are commonly used to sequence portions of genomes.

An image of the 46 chromosomes, making up the diploid genome of human male. (The mitochondrial chromosome is not shown.)

Whole genome sequencing (also known as WGS, full genome sequencing, complete genome sequencing, or entire genome sequencing) is ostensibly the process of determining the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast. In practice, genome sequences that are nearly complete are also called whole genome sequences.

Whole genome sequencing has largely been used as a research tool, but is currently being introduced to clinics. In the future of personalized medicine, whole genome sequence data will be an important tool to guide therapeutic intervention. The tool of gene sequencing at SNP level is also used to pinpoint functional variants from association studies and improve the knowledge available to researchers interested in evolutionary biology, and hence may lay the foundation for predicting disease susceptibility and drug response.

Whole genome sequencing should not be confused with DNA profiling, which only determines the likelihood that genetic material came from a particular individual or group, and does not contain additional information on genetic relationships, origin or susceptibility to specific diseases. In addition, whole genome sequencing should not be confused with methods that sequence specific subsets of the genome - such methods include whole exome sequencing (1% of the genome) or SNP genotyping (less than 0.1% of the genome).

As of 2017 there were no complete genomes for any mammals, including humans. Between 4% to 9% of the human genome, mostly satellite DNA, had not been sequenced.

History

The first whole genome to be sequenced was of the bacterium Haemophilus influenzae.

The worm Caenorhabditis elegans was the first animal to have its whole genome sequenced.

Drosophila melanogaster's whole genome was sequenced in 2000.

Arabidopsis thaliana was the first plant genome sequenced.

The genome of the lab mouse Mus musculus was published in 2002.

It took 10 years and 50 scientists spanning the globe to sequence the genome of Elaeis guineensis (oil palm). This genome was particularly difficult to sequence because it had many repeated sequences which are difficult to organise.

The DNA sequencing methods used in the 1970s and 1980s were manual, for example Maxam-Gilbert sequencing and Sanger sequencing. The shift to more rapid, automated sequencing methods in the 1990s finally allowed for sequencing of whole genomes.

The first organism to have its entire genome sequenced was Haemophilus influenzae in 1995. After it, the genomes of other bacteria and some archaea were first sequenced, largely due to their small genome size. H. influenzae has a genome of 1,830,140 base pairs of DNA. In contrast, eukaryotes, both unicellular and multicellular such as Amoeba dubia and humans (Homo sapiens) respectively, have much larger genomes (see C-value paradox). Amoeba dubia has a genome of 700 billion nucleotide pairs spread across thousands of chromosomes. Humans contain fewer nucleotide pairs (about 3.2 billion in each germ cell - note the exact size of the human genome is still being revised) than A. dubia however their genome size far outweighs the genome size of individual bacteria.

The first bacterial and archaeal genomes, including that of H. influenzae, were sequenced by Shotgun sequencing. In 1996 the first eukaryotic genome (Saccharomyces cerevisiae) was sequenced. S. cerevisiae, a model organism in biology has a genome of only around 12 million nucleotide pairs, and was the first unicellular eukaryote to have its whole genome sequenced. The first multicellular eukaryote, and animal, to have its whole genome sequenced was the nematode worm: Caenorhabditis elegans in 1998. Eukaryotic genomes are sequenced by several methods including Shotgun sequencing of short DNA fragments and sequencing of larger DNA clones from DNA libraries such as bacterial artificial chromosomes (BACs) and yeast artificial chromosomes (YACs).

In 1999, the entire DNA sequence of human chromosome 22, the shortest human autosome, was published. By the year 2000, the second animal and second invertebrate (yet first insect) genome was sequenced - that of the fruit fly Drosophila melanogaster - a popular choice of model organism in experimental research. The first plant genome - that of the model organism Arabidopsis thaliana - was also fully sequenced by 2000. By 2001, a draft of the entire human genome sequence was published. The genome of the laboratory mouse Mus musculus was completed in 2002.

In 2004, the Human Genome Project published an incomplete version of the human genome.

Currently thousands of genomes have been wholly or partially sequenced.

Experimental details

Cells used for sequencing

Almost any biological sample containing a full copy of the DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material necessary for full genome sequencing. Such samples may include saliva, epithelial cells, bone marrow, hair (as long as the hair contains a hair follicle), seeds, plant leaves, or anything else that has DNA-containing cells.

The genome sequence of a single cell selected from a mixed population of cells can be determined using techniques of single cell genome sequencing. This has important advantages in environmental microbiology in cases where a single cell of a particular microorganism species can be isolated from a mixed population by microscopy on the basis of its morphological or other distinguishing characteristics. In such cases the normally necessary steps of isolation and growth of the organism in culture may be omitted, thus allowing the sequencing of a much greater spectrum of organism genomes.

Single cell genome sequencing is being tested as a method of preimplantation genetic diagnosis, wherein a cell from the embryo created by in vitro fertilization is taken and analyzed before embryo transfer into the uterus. After implantation, cell-free fetal DNA can be taken by simple venipuncture from the mother and used for whole genome sequencing of the fetus.

Early techniques

An ABI PRISM 3100 Genetic Analyzer. Such capillary sequencers automated the early efforts of sequencing genomes.

Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of shotgun sequencing technology. While full genome shotgun sequencing for small (4000–7000 base pair) genomes was already in use in 1979, broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.

The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991. In 1995 the innovation of using fragments of varying sizes was introduced, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire genome of the bacterium Haemophilus influenzae in 1995, and then by Celera Genomics to sequence the entire fruit fly genome in 2000, and subsequently the entire human genome. Applied Biosystems, now called Life Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The Human Genome Project.

Current techniques

While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still too expensive and takes too long for commercial purposes. Since 2005 capillary sequencing has been progressively displaced by high-throughput (formerly "next-generation") sequencing technologies such as Illumina dye sequencing, pyrosequencing, and SMRT sequencing. All of these technologies continue to employ the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation.

Other technologies are emerging, including nanopore technology. Though nanopore sequencing technology is still being refined, its portability and potential capability of generating long reads are of relevance to whole-genome sequencing applications.

Analysis

In principle, full genome sequencing can provide the raw nucleotide sequence of an individual organism's DNA. However, further analysis must be performed to provide the biological or medical meaning of this sequence, such as how this knowledge can be used to help prevent disease. Methods for analysing sequencing data are being developed and refined.

Because sequencing generates a lot of data (for example, there are approximately six billion base pairs in each human diploid genome), its output is stored electronically and requires a large amount of computing power and storage capacity.

While analysis of WGS data can be slow, it is possible to speed up this step by using dedicated hardware.

Commercialization

Total cost of sequencing a whole human genome as calculated by the NHGRI.

A number of public and private companies are competing to develop a full genome sequencing platform that is commercially robust for both research and clinical use, including Illumina, Knome, Sequenom, 454 Life Sciences, Pacific Biosciences, Complete Genomics, Helicos Biosciences, GE Global Research (General Electric), Affymetrix, IBM, Intelligent Bio-Systems, Life Technologies and Oxford Nanopore Technologies. These companies are heavily financed and backed by venture capitalists, hedge funds, and investment banks.

A commonly-referenced commercial target for sequencing cost is the $1,000 genome.

Incentive

In October 2006, the X Prize Foundation, working in collaboration with the J. Craig Venter Science Foundation, established the Archon X Prize for Genomics, intending to award $10 million to "the first team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $1,000 per genome".

The Archon X Prize for Genomics was cancelled in 2013, before its official start date.

History

In 2007, Applied Biosystems started selling a new type of sequencer called SOLiD System. The technology allowed users to sequence 60 gigabases per run.

In June 2009, Illumina announced that they were launching their own Personal Full Genome Sequencing Service at a depth of 30× for $48,000 per genome.

In August 2009, the founder of Helicos Biosciences, Stephen Quake, stated that using the company's Single Molecule Sequencer he sequenced his own full genome for less than $50,000.

In November 2009, Complete Genomics published a peer-reviewed paper in Science demonstrating its ability to sequence a complete human genome for $1,700.

In May 2011, Illumina lowered its Full Genome Sequencing service to $5,000 per human genome, or $4,000 if ordering 50 or more. Helicos Biosciences, Pacific Biosciences, Complete Genomics, Illumina, Sequenom, ION Torrent Systems, Halcyon Molecular, NABsys, IBM, and GE Global appear to all be going head to head in the race to commercialize full genome sequencing.

With sequencing costs declining, a number of companies began claiming that their equipment would soon achieve the $1,000 genome: these companies included Life Technologies in January 2012, Oxford Nanopore Technologies in February 2012 and Illumina in February 2014. As of 2015, the NHGRI estimates the cost of obtaining a whole-genome sequence at around $1,500.

In 2016, Veritas Corp. began selling whole gene sequencing, including a report as to some of the information in the sequencing for $999. Effective use of whole gene sequencing can cost considerably more. Note, also, that there remain parts of the human genome that have not been fully sequenced.

Comparison with other technologies

DNA microarrays

Full genome sequencing provides information on a genome that is orders of magnitude larger than by DNA arrays, the previous leader in genotyping technology.

For humans, DNA arrays currently provide genotypic information on up to one million genetic variants, while full genome sequencing will provide information on all six billion bases in the human genome, or 3,000 times more data. Because of this, full genome sequencing is considered a disruptive innovation to the DNA array markets as the accuracy of both range from 99.98% to 99.999% (in non-repetitive DNA regions) and their consumables cost of $5000 per 6 billion base pairs is competitive (for some applications) with DNA arrays ($500 per 1 million basepairs).

Applications

Mutation frequencies

Whole genome sequencing has established the mutation frequency for whole human genomes. The mutation frequency in the whole genome between generations for humans (parent to child) is about 70 new mutations per generation. An even lower level of variation was found comparing whole genome sequencing in blood cells for a pair of monozygotic (identical twins) 100-year-old centenarians. Only 8 somatic differences were found, though somatic variation occurring in less than 20% of blood cells would be undetected.

In the specifically protein coding regions of the human genome, it is estimated that there are about 0.35 mutations that would change the protein sequence between parent/child generations (less than one mutated protein per generation).

In cancer, mutation frequencies are much higher, due to genome instability. This frequency can further depend on patient age, exposure to DNA damaging agents (such as UV-irradiation or components of tobacco smoke) and the activity/inactivity of DNA repair mechanisms. Furthermore, mutation frequency can vary between cancer types: in germline cells, mutation rates occur at approximately 0.023 mutations per megabase, but this number is much higher in breast cancer (1.18-1.66 somatic mutations per Mb), in lung cancer (17.7) or in melanomas (~33). Since the haploid human genome consists of approximately 3,200 megabases, this translates into about 74 mutations (mostly in noncoding regions) in germline DNA per generation, but 3,776-5,312 somatic mutations per haploid genome in breast cancer, 56,640 in lung cancer and 105,600 in melanomas.
The distribution of somatic mutations across the human genome is very uneven, such that the gene-rich, early-replicating regions receive fewer mutations than gene-poor, late-replicating heterochromatin, likely due to differential DNA repair activity. In particular, the histone modification H3K9me3 is associated with high, and H3K36me3 with low mutation frequencies.

Genome-wide association studies

In research, whole-genome sequencing can be used in a Genome-Wide Association Study (GWAS) - a project aiming to determine the genetic variant or variants associated with a disease or some other phenotype.

Diagnostic use

In 2009, Illumina released its first whole genome sequencers that were approved for clinical as opposed to research-only use and doctors at academic medical centers began quietly using them to try to diagnose what was wrong with people whom standard approaches had failed to help. The price to sequence a genome at that time was US$19,500, which was billed to the patient but usually paid for out of a research grant; one person at that time had applied for reimbursement from their insurance company. For example, one child had needed around 100 surgeries by the time he was three years old, and his doctor turned to whole genome sequencing to determine the problem; it took a team of around 30 people that included 12 bioinformatics experts, three sequencing technicians, five physicians, two genetic counsellors and two ethicists to identify a rare mutation in the XIAP that was causing widespread problems.

Currently available newborn screening for childhood diseases allows detection of rare disorders that can be prevented or better treated by early detection and intervention. Specific genetic tests are also available to determine an etiology when a child's symptoms appear to have a genetic basis. Full genome sequencing, in addition has the potential to reveal a large amount of information (such as carrier status for autosomal recessive disorders, genetic risk factors for complex adult-onset diseases, and other predictive medical and non-medical information) that is currently not completely understood, may not be clinically useful to the child during childhood, and may not necessarily be wanted by the individual upon reaching adulthood.

Due to recent cost reductions (see above) whole genome sequencing has become a realistic application in DNA diagnostics. In 2013, the 3Gb-TEST consortium obtained funding from the European Union to prepare the health care system for these innovations in DNA diagnostics. Quality assessment schemes, Health technology assessment and guidelines have to be in place. The 3Gb-TEST consortium has identified the analysis and interpretation of sequence data as the most complicated step in the diagnostic process. At the Consortium meeting in Athens in September 2014, the Consortium coined the word genotranslation for this crucial step. This step leads to a so-called genoreport. Guidelines are needed to determine the required content of these reports.

Genomes2People (G2P), an initiative of Brigham and Women's Hospital and Harvard Medical School was created in 2011 to examine the integration of genomic sequencing into clinical care of adults and children. G2P's director, Robert C. Green, had previously led the REVEAL study — Risk Evaluation and Education for Alzheimer’s Disease – a series of clinical trials exploring patient reactions to the knowledge of their genetic risk for Alzheimer’s.

Ethical concerns

The introduction of whole genome sequencing may have ethical implications. On one hand, genetic testing can potentially diagnose preventable diseases, both in the individual undergoing genetic testing and in their relatives. On the other hand, genetic testing has potential downsides such as genetic discrimination, loss of anonymity, and psychological impacts such as discovery of non-paternity.

Some ethicists insist that the privacy of individuals undergoing genetic testing must be protected. Indeed, privacy issues can be of particular concern when minors undergo genetic testing. Illumina's CEO, Jay Flatley, claimed in February 2009 that "by 2019 it will have become routine to map infants' genes when they are born". This potential use of genome sequencing is highly controversial, as it runs counter to established ethical norms for predictive genetic testing of asymptomatic minors that have been well established in the fields of medical genetics and genetic counseling. The traditional guidelines for genetic testing have been developed over the course of several decades since it first became possible to test for genetic markers associated with disease, prior to the advent of cost-effective, comprehensive genetic screening.

When an individual undergoes whole genome sequencing, they reveal information about not only their own DNA sequences, but also about probable DNA sequences of their close genetic relatives. This information can further reveal useful predictive information about relatives' present and future health risks. Hence, there are important questions about what obligations, if any, are owed to the family members of the individuals who are undergoing genetic testing. In Western/European society, tested individuals are usually encouraged to share important information on any genetic diagnoses with their close relatives, since the importance of the genetic diagnosis for offspring and other close relatives is usually one of the reasons for seeking a genetic testing in the first place. Nevertheless, a major ethical dilemma can develop when the patients refuse to share information on a diagnosis that is made for serious genetic disorder that is highly preventable and where there is a high risk to relatives carrying the same disease mutation. Under such circumstances, the clinician may suspect that the relatives would rather know of the diagnosis and hence the clinician can face a conflict of interest with respect to patient-doctor confidentiality.

Privacy concerns can also arise when whole genome sequencing is used in scientific research studies. Researchers often need to put information on patient's genotypes and phenotypes into public scientific databases, such as locus specific databases. Although only anonymous patient data are submitted to locus specific databases, patients might still be identifiable by their relatives in the case of finding a rare disease or a rare missense mutation.

People with public genome sequences

The first nearly complete human genomes sequenced were two Americans of predominantly Northwestern European ancestry in 2007 (J. Craig Venter at 7.5-fold coverage, and James Watson at 7.4-fold). This was followed in 2008 by sequencing of an anonymous Han Chinese man (at 36-fold), a Yoruban man from Nigeria (at 30-fold), and a female caucasian Leukemia patient (at 33 and 14-fold coverage for tumor and normal tissues). Steve Jobs was among the first 20 people to have their whole genome sequenced, reportedly for the cost of $100,000. As of June 2012, there were 69 nearly complete human genomes publicly available. In November 2013, a Spanish family made their personal genomics data publicly available under a Creative Commons public domain license. The work was led by Manuel Corpas and the data obtained by direct-to-consumer genetic testing with 23andMe and the Beijing Genomics Institute). This is believed to be the first such public genomics dataset for a whole family.

Search This Blog

Thursday, October 25, 2018

Computational phylogenetics

Types of phylogenetic trees and networks

Coding characters and defining homology

Morphological analysis

Molecular analysis

Distance-matrix methods

UPGMA and WPGMA

Neighbor-joining

Fitch-Margoliash method

Using outgroups

Maximum parsimony

Branch and bound

Sankoff-Morel-Cedergren algorithm

MALIGN and POY

Maximum likelihood

Bayesian inference

Model selection

Types of models

Choosing the best model

Evaluating tree support

Nodal support

Consensus tree

Bootstrapping and jackknifing

Posterior probability

Step counting methods

Shortcomings

Limitations and workarounds

Homoplasy

Horizontal gene transfer

Hybrids, speciation, introgressions and incomplete lineage sorting

Taxon sampling

Phylogenetic signal

Continuous characters

Missing data

The role of fossils

Computational biology

Introduction

Subfields

Computational anatomy

Computational biomodeling

Computational genomics (Computational genetics)

Computational neuroscience

Computational pharmacology

Computational evolutionary biology

Cancer computational biology

Computational neuropsychiatry

Software and tools

Open source software

Conferences

Journals

Related fields

Personalized medicine

Development of concept

Background

Basics

Method

Disease risk assessment

Applications

Diagnosis and intervention

Drug development and usage

Respiratory Proteomics

Cancer genomics

Challenges

Regulatory oversight

Intellectual property rights

Supply chain

Reimbursement policies

Patient privacy and confidentiality

Wednesday, October 24, 2018

Whole genome sequencing

History

Experimental details

Cells used for sequencing

Early techniques

Current techniques

Analysis

Commercialization

Incentive