The tips of a phylogenetic tree can be living taxa or fossils,
which represent the present time or "end" of an evolutionary lineage,
respectively. A phylogenetic diagram can be rooted or unrooted. A rooted
tree diagram indicates the hypothetical common ancestor
of the tree. An unrooted tree diagram (a network) makes no assumption
about the ancestral line, and does not show the origin or "root" of the
taxa in question or the direction of inferred evolutionary
transformations.
In addition to their use for inferring phylogenetic patterns
among taxa, phylogenetic analyses are often employed to represent
relationships among genes or individual organisms. Such uses have become
central to understanding biodiversity, evolution, ecology, and genomes.
Phylogenetics is a component of systematics
that uses similarities and differences of the characteristics of
species to interpret their evolutionary relationships and origins.
Phylogenetics focuses on whether the characteristics of a species
reinforce a phylogenetic inference that it diverged from the most recent
common ancestor of a taxonomic group.
In the field of cancer research, phylogenetics can be used to study the clonal evolution of tumors and molecular chronology, predicting and showing how cell populations vary throughout the progression of the disease and during treatment, using whole genome sequencing techniques.
The evolutionary processes behind cancer progression are quite
different from those in most species and are important to phylogenetic
inference; these differences manifest in several areas: the types of
aberrations that occur, the rates of mutation, the high heterogeneity (variability) of tumor cell subclones, and the absence of genetic recombination.
Phylogenetics can also aid in drug design
and discovery. Phylogenetics allows scientists to organize species and
can show which species are likely to have inherited particular traits
that are medically useful, such as producing biologically active
compounds - those that have effects on the human body. For example, in
drug discovery, venom-producing animals are particularly useful. Venoms from these animals produce several important drugs, e.g., ACE inhibitors and Prialt (Ziconotide).
To find new venoms, scientists turn to phylogenetics to screen for
closely related species that may have the same useful traits. The
phylogenetic tree shows which species of fish
have an origin of venom, and related fish they may contain the trait.
Using this approach in studying venomous fish, biologists are able to
identify the fish species that may be venomous. Biologist have used this
approach in many species such as snakes and lizards.
In forensic science,
phylogenetic tools are useful to assess DNA evidence for court cases.
The simple phylogenetic tree of viruses A-E shows the relationships
between viruses e.g., all viruses are descendants of Virus A.
HIV
forensics uses phylogenetic analysis to track the differences in HIV
genes and determine the relatedness of two samples. Phylogenetic
analysis has been used in criminal trials to exonerate or hold
individuals. HIV forensics does have its limitations, i.e., it cannot be
the sole proof of transmission between individuals and phylogenetic
analysis which shows transmission relatedness does not indicate
direction of transmission.
One small clade of fish, showing how venom has evolved multiple times.
Taxonomy is the identification, naming, and classification of organisms. Compared to systemization, classification emphasizes whether a species has characteristics of a taxonomic group. The Linnaean classification system developed in the 1700s by Carolus Linnaeus
is the foundation for modern classification methods. Linnaean
classification relies on an organism's phenotype or physical
characteristics to group and organize species. With the emergence of biochemistry, organism classifications are now usually based on phylogenetic data, and many systematists contend that only monophyletic taxa should be recognized as named groups. The degree to which classification depends on inferred evolutionary history differs depending on the school of taxonomy: phenetics ignores phylogenetic speculation altogether, trying to represent the similarity between organisms instead; cladistics
(phylogenetic systematics) tries to reflect phylogeny in its
classifications by only recognizing groups based on shared, derived
characters (synapomorphies); evolutionary taxonomy tries to take into account both the branching pattern and "degree of difference" to find a compromise between them.
Phenetics, popular in the mid-20th century but now largely obsolete, used distance matrix-based methods to construct trees based on overall similarity in morphology or similar observable traits (i.e. in the phenotype or the overall similarity of DNA, not the DNA sequence), which was often assumed to approximate phylogenetic relationships.
Prior to 1950, phylogenetic inferences were generally presented as narrative scenarios. Such methods are often ambiguous and lack explicit criteria for evaluating alternative hypotheses.
Impacts of taxon sampling
In
phylogenetic analysis, taxon sampling selects a small group of taxa to
represent the evolutionary history of its broader population. This process is also known as stratified sampling or clade-based sampling. The practice occurs given limited resources to compare and analyze every species within a target population.
Based on the representative group selected, the construction and
accuracy of phylogenetic trees vary, which impacts derived phylogenetic
inferences.
Unavailable datasets, such as an organism's incomplete DNA and protein amino acid sequences in genomic databases, directly restrict taxonomic sampling. Consequently, a significant source of error
within phylogenetic analysis occurs due to inadequate taxon samples.
Accuracy may be improved by increasing the number of genetic samples
within its monophyletic group. Conversely, increasing sampling from
outgroups extraneous to the target stratified population may decrease
accuracy. Long branch attraction
is an attributed theory for this occurrence, where nonrelated branches
are incorrectly classified together, insinuating a shared evolutionary
history.
Percentage
of inter-ordinal branches reconstructed with a constant number of bases
and four phylogenetic tree construction models; neighbor-joining (NJ),
minimum evolution (ME), unweighted maximum parsimony (MP), and maximum
likelihood (ML). Demonstrates phylogenetic analysis with fewer taxa and
more genes per taxon matches more often with the replicable consensus
tree. The dotted line demonstrates an equal accuracy increase between
the two taxon sampling methods. Figure is property of Michael S.
Rosenberg and Sudhir Kumar as presented in the journal article Taxon Sampling, Bioinformatics, and Phylogenomics.
There are debates if increasing the number of taxa sampled improves
phylogenetic accuracy more than increasing the number of genes sampled
per taxon. Differences in each method's sampling impact the number of
nucleotide sites utilized in a sequence alignment, which may contribute
to disagreements. For example, phylogenetic trees constructed utilizing a
more significant number of total nucleotides are generally more
accurate, as supported by phylogenetic trees' bootstrapping
replicability from random sampling.
The graphic presented in Taxon Sampling, Bioinformatics, and Phylogenomics,
compares the correctness of phylogenetic trees generated using fewer
taxa and more sites per taxon on the x-axis to more taxa and fewer sites
per taxon on the y-axis. With fewer taxa, more genes are sampled
amongst the taxonomic group; in comparison, with more taxa added to the
taxonomic sampling group, fewer genes are sampled. Each method has the
same total number of nucleotide sites sampled. Furthermore, the dotted
line represents a 1:1 accuracy between the two sampling methods. As seen
in the graphic, most of the plotted points are located below the dotted
line, which indicates gravitation toward increased accuracy when
sampling fewer taxa with more sites per taxon. The research performed
utilizes four different phylogenetic tree construction models to verify
the theory; neighbor-joining (NJ), minimum evolution (ME), unweighted
maximum parsimony (MP), and maximum likelihood (ML). In the majority of
models, sampling fewer taxon with more sites per taxon demonstrated
higher accuracy.
Generally, with the alignment of a relatively equal number of
total nucleotide sites, sampling more genes per taxon has higher
bootstrapping replicability than sampling more taxa. However, unbalanced
datasets within genomic databases make increasing the gene comparison
per taxon in uncommonly sampled organisms increasingly difficult.
History
Overview
The term "phylogeny" derives from the German Phylogenie, introduced by Haeckel in 1866, and the Darwinian approach to classification became known as the "phyletic" approach. It can be traced back to Aristotle, who wrote in his Posterior Analytics,
"We may assume the superiority ceteris paribus [other things being
equal] of the demonstration which derives from fewer postulates or
hypotheses."
Ernst Haeckel's recapitulation theory
The
modern concept of phylogenetics evolved primarily as a disproof of a
previously widely accepted theory. During the late 19th century, Ernst Haeckel's recapitulation theory, or "biogenetic fundamental law", was widely popular. It was often expressed as "ontogeny
recapitulates phylogeny", i.e. the development of a single organism
during its lifetime, from germ to adult, successively mirrors the adult
stages of successive ancestors of the species to which it belongs. But
this theory has long been rejected. Instead, ontogeny evolves –
the phylogenetic history of a species cannot be read directly from its
ontogeny, as Haeckel thought would be possible, but characters from
ontogeny can be (and have been) used as data for phylogenetic analyses;
the more closely related two species are, the more apomorphies their embryos share.
Timeline of key points
Branching tree diagram from Heinrich Georg Bronn's work (1858)Phylogenetic tree suggested by Haeckel (1866)
14th century, lex parsimoniae (parsimony principle), William of Ockam, English philosopher, theologian, and Franciscan friar, but the idea actually goes back to Aristotle, as a precursor concept. He introduced the concept of Occam's razor,
which is the problem solving principle that recommends searching for
explanations constructed with the smallest possible set of elements.
Though he did not use these exact words, the principle can be summarized
as "Entities must not be multiplied beyond necessity." The principle
advocates that when presented with competing hypotheses about the same
prediction, one should prefer the one that requires fewest assumptions.
1763, Bayesian probability, Rev. Thomas Bayes,
a precursor concept. Bayesian probability began a resurgence in the
1950s, allowing scientists in the computing field to pair traditional
Bayesian statistics with other more modern techniques. It is now used as
a blanket term for several related interpretations of probability as an
amount of epistemic confidence.
18th century, Pierre Simon (Marquis de Laplace), perhaps first to
use ML (maximum likelihood), precursor concept. His work gave way to the
Laplace distribution, which can be directly linked to least absolute deviations.
1809, evolutionary theory, Philosophie Zoologique,Jean-Baptiste de Lamarck,
precursor concept, foreshadowed in the 17th century and 18th century by
Voltaire, Descartes, and Leibniz, with Leibniz even proposing
evolutionary changes to account for observed gaps suggesting that many
species had become extinct, others transformed, and different species
that share common traits may have at one time been a single race, also foreshadowed by some early Greek philosophers such as Anaximander in the 6th century BC and the atomists of the 5th century BC, who proposed rudimentary theories of evolution
1837, Darwin's notebooks show an evolutionary tree
1840, American Geologist Edward Hitchcock published what is
considered to be the first paleontological "Tree of Life". Many
critiques, modifications, and explanations would follow.This
chart displays one of the first published attempts at a paleontological
"Tree of Life" by Geologist Edward Hitchcock. (1840)
1843, distinction between homology and analogy (the latter now referred to as homoplasy),
Richard Owen, precursor concept. Homology is the term used to
characterize the similarity of features that can be parsimoniously
explained by common ancestry. Homoplasy is the term used to describe a
feature that has been gained or lost independently in separate lineages
over the course of evolution.
1858, Paleontologist Heinrich Georg Bronn (1800–1862) published a
hypothetical tree to illustrating the paleontological "arrival" of new,
similar species. following the extinction of an older species. Bronn did
not propose a mechanism responsible for such phenomena, precursor
concept.
1858, elaboration of evolutionary theory, Darwin and Wallace, also in Origin of Species by Darwin the following year, precursor concept.
1866, Ernst Haeckel,
first publishes his phylogeny-based evolutionary tree, precursor
concept. Haeckel introduces the now-disproved recapitulation theory. He
introduced the term "Cladus" as a taxonomic category just below
subphylum.
1893, Dollo's Law of Character State Irreversibility,
precursor concept. Dollo's Law of Irreversibility states that "an
organism never comes back exactly to its previous state due to the
indestructible nature of the past, it always retains some trace of the
transitional stages through which it has passed."
1912, ML (maximum likelihood recommended, analyzed, and popularized by Ronald Fisher,
precursor concept. Fisher is one of the main contributors to the early
20th-century revival of Darwinism, and has been called the "greatest of
Darwin's successors" for his contributions to the revision of the theory
of evolution and his use of mathematics to combine Mendelian genetics and natural selection in the 20th century "modern synthesis".
1921, Tillyard uses term "phylogenetic" and distinguishes between
archaic and specialized characters in his classification system.
1940, Lucien Cuénot coined the term "clade" in 1940: "terme nouveau de clade (du grec κλάδοςç, branche) [A new term clade (from the Greek word klados, meaning branch)]". He used it for evolutionary branching.
1947, Bernhard Rensch introduced the term Kladogenesis in his German book Neuere Probleme der Abstammungslehre Die transspezifische Evolution, translated into English in 1959 as Evolution Above the Species Level (still using the same spelling).
1949, Jackknife resampling, Maurice Quenouille (foreshadowed in '46 by Mahalanobis and extended in '58 by Tukey), precursor concept.
1950, Willi Hennig's classic formalization.
Hennig is considered the founder of phylogenetic systematics, and
published his first works in German of this year. He also asserted a
version of the parsimony principle, stating that the presence of
amorphous characters in different species 'is always reason for
suspecting kinship, and that their origin by convergence should not be
presumed a priori'. This has been considered a foundational view of phylogenetic inference.
1952, William Wagner's ground plan divergence method.
1957, Julian Huxley adopted Rensch's terminology as "cladogenesis" with a full definition: "Cladogenesis
I have taken over directly from Rensch, to denote all splitting, from
subspeciation through adaptive radiation to the divergence of phyla and
kingdoms." With it he introduced the word "clades", defining it as:
"Cladogenesis results in the formation of delimitable monophyletic
units, which may be called clades."
1963, first attempt to use ML (maximum likelihood) for phylogenetics, Edwards and Cavalli-Sforza.
1965
Camin-Sokal parsimony, first parsimony (optimization) criterion
and first computer program/algorithm for cladistic analysis both by
Camin and Sokal.
Character compatibility method, also called clique analysis, introduced independently by Camin and Sokal (loc. cit.) and E. O. Wilson.
1966
English translation of Hennig.
"Cladistics" and "cladogram" coined (Webster's, loc. cit.)
1969
Dynamic and successive weighting, James Farris.
Wagner parsimony, Kluge and Farris.
CI (consistency index), Kluge and Farris.
Introduction of pairwise compatibility for clique analysis, Le Quesne.
1970, Wagner parsimony generalized by Farris.
1971
First successful application of ML (maximum likelihood) to phylogenetics (for protein sequences), Neyman.
Fitch parsimony, Walter M. Fitch. These gave way to the most basic ideas of maximum parsimony. Fitch is known for his work on reconstructing phylogenetic trees from protein and DNA sequences. His definition of orthologous sequences has been referenced in many research publications.
NNI (nearest neighbour interchange), first branch-swapping search strategy, developed independently by Robinson and Moore et al.
ME (minimum evolution), Kidd and Sgaramella-Zonta
(it is unclear if this is the pairwise distance method or related to ML
as Edwards and Cavalli-Sforza call ML "minimum evolution").
1980, PHYLIP, first software package for phylogenetic analysis, Joseph Felsenstein. A free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies).
One such example tree created by PHYLIP, called a "drawgram", generates
rooted trees. This image shown in the figure below shows the evolution
of phylogenetic trees over time.
1981
Majority consensus, Margush and MacMorris.
Strict consensus, Sokal and RohlfThis
image depicts a PHYLIP generated drawgram. This drawgram is an example
of one of the possible trees the software is capable of generating.first computationally efficient ML (maximum likelihood) algorithm.
Felsenstein created the Felsenstein Maximum Likelihood method, used for
the inference of phylogeny which evaluates a hypothesis about
evolutionary history in terms of the probability that the proposed model
and the hypothesized history would give rise to the observed data set.
1982
PHYSIS, Mikevich and Farris
Branch and bound, Hendy and Penny
1985
First cladistic analysis of eukaryotes based on combined phenotypic and genotypic evidence Diana Lipscomb.
First issue of Cladistics.
First phylogenetic application of bootstrap, Felsenstein.
First phylogenetic application of jackknife, Scott Lanyon.
1986, MacClade, Maddison and Maddison.
1987, neighbor-joining method Saitou and Nei
1988, Hennig86 (version 1.5), Farris
Bremer support (decay index), Bremer.
1989
RI (retention index), RCI (rescaled consistency index), Farris.
1996, first working methods for BI (Bayesian Inference) independently developed by Li, Mau, and Rannala and Yang and all using MCMC (Markov chain-Monte Carlo).
1998, TNT (Tree Analysis Using New Technology), Goloboff, Farris, and Nixon.
1999, Winclada, Nixon.
2003, symmetrical resampling, Goloboff.
2004, 2005, similarity metric (using an approximation to Kolmogorov
complexity) or NCD (normalized compression distance), Li et al., Cilibrasi and Vitanyi.
Uses of phylogenetic analysis
Pharmacology
One use of phylogenetic analysis involves the pharmacological examination of closely related groups of organisms. Advances in cladistics
analysis through faster computer programs and improved molecular
techniques have increased the precision of phylogenetic determination,
allowing for the identification of species with pharmacological
potential.
Historically, phylogenetic screens for pharmacological purposes were used in a basic manner, such as studying the Apocynaceae family of plants, which includes alkaloid-producing species like Catharanthus, known for producing vincristine,
an antileukemia drug. Modern techniques now enable researchers to study
close relatives of a species to uncover either a higher abundance of
important bioactive compounds (e.g., species of Taxus for taxol) or natural variants of known pharmaceuticals (e.g., species of Catharanthus for different forms of vincristine or vinblastine).
Biodiversity
Phylogenetic
analysis has also been applied to biodiversity studies within the fungi
family. Phylogenetic analysis helps understand the evolutionary history
of various groups of organisms, identify relationships between
different species, and predict future evolutionary changes. Emerging
imagery systems and new analysis techniques allow for the discovery of
more genetic relationships in biodiverse fields, which can aid in
conservation efforts by identifying rare species that could benefit
ecosystems globally.
Phylogenetic Subtree of fungi containing different biodiverse sections of the fungi group.
Infectious disease epidemiology
Whole-genome sequence
data from outbreaks or epidemics of infectious diseases can provide
important insights into transmission dynamics and inform public health
strategies. Traditionally, studies have combined genomic and
epidemiological data to reconstruct transmission events. However, recent
research has explored deducing transmission patterns solely from
genomic data using phylodynamics,
which involves analyzing the properties of pathogen phylogenies.
Phylodynamics uses theoretical models to compare predicted branch
lengths with actual branch lengths in phylogenies to infer transmission
patterns. Additionally, coalescent theory,
which describes probability distributions on trees based on population
size, has been adapted for epidemiological purposes. Another source of
information within phylogenies that has been explored is "tree shape."
These approaches, while computationally intensive, have the potential to
provide valuable insights into pathogen transmission dynamics.
Pathogen Transmission Trees
The structure of the host contact network significantly impacts the
dynamics of outbreaks, and management strategies rely on understanding
these transmission patterns. Pathogen genomes spreading through
different contact network structures, such as chains, homogeneous
networks, or networks with super-spreaders, accumulate mutations in
distinct patterns, resulting in noticeable differences in the shape of
phylogenetic trees, as illustrated in Fig. 1. Researchers have analyzed
the structural characteristics of phylogenetic trees generated from
simulated bacterial genome evolution across multiple types of contact
networks. By examining simple topological properties of these trees,
researchers can classify them into chain-like, homogeneous, or
super-spreading dynamics, revealing transmission patterns. These
properties form the basis of a computational classifier used to analyze
real-world outbreaks. Computational predictions of transmission dynamics
for each outbreak often align with known epidemiological data.
Graphical Representation of Phylogenetic Tree analysis
Different transmission networks result in quantitatively different
tree shapes. To determine whether tree shapes captured information about
underlying disease transmission patterns, researchers simulated the
evolution of a bacterial genome over three types of outbreak contact
networks—homogeneous, super-spreading, and chain-like. They summarized
the resulting phylogenies with five metrics describing tree shape.
Figures 2 and 3 illustrate the distributions of these metrics across the
three types of outbreaks, revealing clear differences in tree topology
depending on the underlying host contact network.
Super-spreader networks give rise to phylogenies with higher
Colless imbalance, longer ladder patterns, lower Δw, and deeper trees
than those from homogeneous contact networks. Trees from chain-like
networks are less variable, deeper, more imbalanced, and narrower than
those from other networks.
Scatter plots can be used to visualize the relationship between
two variables in pathogen transmission analysis, such as the number of
infected individuals and the time since infection. These plots can help
identify trends and patterns, such as whether the spread of the pathogen
is increasing or decreasing over time, and can highlight potential
transmission routes or super-spreader events. Box plots
displaying the range, median, quartiles, and potential outliers
datasets can also be valuable for analyzing pathogen transmission data,
helping to identify important features in the data distribution. They
may be used to quickly identify differences or similarities in the
transmission data.
Disciplines other than biology
Phylogeny of Indo-European languages
Phylogenetic tools and representations (trees and networks) can also be applied to philology, the study of the evolution of oral languages and written text and manuscripts, such as in the field of quantitative comparative linguistics.
Computational phylogenetics can be used to investigate a language
as an evolutionary system. The evolution of human language closely
corresponds with human's biological evolution which allows phylogenetic
methods to be applied. The concept of a "tree" serves as an efficient
way to represent relationships between languages and language splits. It
also serves as a way of testing hypotheses about the connections and
ages of language families. For example, relationships among languages
can be shown by using cognates as characters.
The phylogenetic tree of Indo-European languages shows the
relationships between several of the languages in a timeline, as well as
the similarity between words and word order.
There are three types of criticisms about using phylogenetics in
philology, the first arguing that languages and species are different
entities, therefore you can not use the same methods to study both. The
second being how phylogenetic methods are being applied to linguistic
data. And the third, discusses the types of data that is being used to
construct the trees.
Bayesian phylogenetic
methods, which are sensitive to how treelike the data is, allow for the
reconstruction of relationships among languages, locally and globally.
The main two reasons for the use of Bayesian phylogenetics are that (1)
diverse scenarios can be included in calculations and (2) the output is a
sample of trees and not a single tree with true claim.
The same process can be applied to texts and manuscripts. In Paleography,
the study of historical writings and manuscripts, texts were replicated
by scribes who copied from their source and alterations - i.e.,
'mutations' - occurred when the scribe did not precisely copy the
source.
Phylogenetics has been applied to archaeological artefacts such as the early hominin hand-axes, late Palaeolithic figurines, Neolithic stone arrowheads, Bronze Age ceramics, and historical-period houses.
Bayesian methods have also been employed by archaeologists in an
attempt to quantify uncertainty in the tree topology and divergence
times of stone projectile point shapes in the European Final
Palaeolithic and earliest Mesolithic.