A Medley of Potpourri

Monday, June 25, 2018

Models of DNA evolution

From Wikipedia, the free encyclopedia

A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree (in Bayesian and maximum likelihood approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.

Introduction

These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states.^[1] These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to transversions in evolving sequences. However, the Kimura (K80) model described below merely attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions.

Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the Q matrices below). If we are given a starting (ancestral) state at one position, the model's Q matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in the mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons).

The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed. This assumption may be justifiable if the sites can be assumed to be evolving neutrally. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites.

DNA evolution as a continuous-time Markov chain

Continuous-time Markov chains

Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time,

t\

. Specifically, if

E_{1},E_{2},E_{3},E_{4}\

are the states, then the transition matrix

P(t)={\big (}P_{ij}(t){\big )}

where each individual entry,

P_{ij}(t)\

refers to the probability that state

E_{i}\

will change to state

E_{j}\

in time

t\

Example: We would like to model the substitution process in DNA sequences (i.e. Jukes–Cantor, Kimura, etc.) in a continuous-time fashion. The corresponding transition matrices will look like:

P(t)={\begin{pmatrix}p_{AA}(t)&p_{AG}(t)&p_{AC}(t)&p_{AT}(t)\\p_{GA}(t)&p_{GG}(t)&p_{GC}(t)&p_{GT}(t)\\p_{CA}(t)&p_{CG}(t)&p_{CC}(t)&p_{CT}(t)\\p_{TA}(t)&p_{TG}(t)&p_{TC}(t)&p_{TT}(t)\end{pmatrix}}

where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities.

Assumption: If at some time

t_{0}\

, the Markov chain is in state

E_{i}\

, then the probability that at time

t_{0}+t\

, it will be in state

E_{j}\

depends only upon

i\

j\

and

t\

. This then allows us to write that probability as

p_{ij}(t)\

.

Theorem: Continuous-time transition matrices satisfy:

P(t+\tau )=P(t)P(\tau )\

Note: There is here a possible confusion between two meanings of the word transition. (i) In the context of Markov chains, transition is the general term that refers to the change between two states. (ii) In the context of nucleotide changes in DNA sequences, transition is a specific term that refers to the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about transitions in genetics). By contrast, an exchange between one purine and one pyrimidine is called a transversion.

Deriving the dynamics of substitution

Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let

\mathbf {P} (t)=(p_{A}(t),\ p_{G}(t),\ p_{C}(t),\ p_{T}(t))

probabilities of states

A,\

\ G,\

\ C,\

and

\ T\

at time

t\

. Let

{\mathcal {E}}=\{A,\ G,\ C,\ T\}

be the state-space. For two distinct

x,y\in {\mathcal {E}}

, let

\mu _{xy}\

be the transition rate from state

x\

to state

y\

. Similarly, for any

x\

, let the rate of change to

x\

be:

\mu _{x}=\sum _{y\neq x}\mu _{xy}

The changes in the probability distribution

p_{A}(t)\

for small increments of time

\Delta t\

are given by:

p_{A}(t+\Delta t)=p_{A}(t)-p_{A}(t)\mu _{A}\Delta t+\sum _{x\neq A}p_{x}(t)\mu _{xA}\Delta t

In other words, (in frequentist language), the frequency of

A\

's at time

t+\Delta t\

is equal to the frequency at time

t\

minus the frequency of the lost

A\

's plus the frequency of the newly created

A\

's.
Similarly for the probabilities

p_{G}(t),\ p_{C}(t),\ \mathrm {and} \ p_{T}(t)

. We can write these compactly as:

\mathbf {P} (t+\Delta t)=\mathbf {P} (t)+\mathbf {P} (t)Q\Delta t

where,

{\displaystyle Q={\begin{pmatrix}-\mu _{A}&\mu _{AG}&\mu _{AC}&\mu _{AT}\\\mu _{GA}&-\mu _{G}&\mu _{GC}&\mu _{GT}\\\mu _{CA}&\mu _{CG}&-\mu _{C}&\mu _{CT}\\\mu _{TA}&\mu _{TG}&\mu _{TC}&-\mu _{T}\end{pmatrix}}}

or, alternately:

\mathbf {P} '(t)=\mathbf {P} (t)Q

where,

Q\

is the rate matrix. Note that by definition, the sum of the entries in each rows of matrix

Q\

is equal to zero. For a stationary process, where

Q\

does not depend upon time t, this differential equation is solvable using matrix exponentiation:

P(t)=\exp(Qt)

and

\mathbf {P} (t)=\mathbf {P} (0)P(t)=\mathbf {P} (0)\exp(Qt)\,.

Ergodicity

If all the transition probabilities,

\mu _{xy}\

are positive, i.e. if all states

x,y\in {\mathcal {E}}\

communicate, then the Markov chain has a unique stationary distribution

\mathbf {\Pi } =\{\pi _{x},\ x\in {\mathcal {E}}\}

where each

\pi _{x}\

is the proportion of time spent in state

x\

after the Markov chain has run for infinite time. Such a Markov chain is called, ergodic. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies,

\pi _{A},\pi _{G},\pi _{C},\pi _{T}\

correspond to equilibrium base compositions.

When the current distribution

\mathbf {P} (t)

is the stationary distribution

\mathbf {\Pi }

, then it follows that

\mathbf {\Pi } Q=0

using the differential equation above,

\mathbf {\Pi } Q=\mathbf {P} (t)Q={\frac {d\mathbf {P} (t)}{dt}}=0\,.

Time reversibility

Definition: A stationary Markov process is time reversible if (in the steady state) the amount of change from state

x\

y\

is equal to the amount of change from

y\

x\

, (although the two states may occur with different frequencies). This means that:

\pi _{x}\mu _{xy}=\pi _{y}\mu _{yx}\

Not all stationary processes are reversible, however, most commonly used DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.

Under the time reversibility assumption, let

s_{xy}=\mu _{xy}/\pi _{y}\

, then it is easy to see that:

s_{xy}=s_{yx}\

Definition The symmetric term

s_{xy}\

is called the exchangeability between states

x\

and

y\

. In other words,

s_{xy}\

is the fraction of the frequency of state

x\

that is the result of transitions from state

y\

to state

x\

.

Corollary The 12 off-diagonal entries of the rate matrix,

Q\

(note the off-diagonal entries determine the diagonal entries, since the rows of

Q\

sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies

\pi _{x}\

, (since the stationary frequencies sum to 1).

Scaling of branch lengths

By comparing extant sequences, one can determine the amount of sequence divergence. This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences. The simple count of differences (the Hamming distance) between sequences will often underestimate the number of substitution because of multiple hits (see homoplasy). Trying to estimate the exact number of changes that have occurred is difficult, and usually not necessary. Instead, branch lengths (and path lengths) in phylogenetic analyses are usually expressed in the expected number of changes per site. The path length is the product of the duration of the path in time and the mean rate of substitutions. While their product can be estimated, the rate and time are not identifiable from sequence divergence.

The descriptions of rate matrices on this page accurately reflect the relative magnitude of different substitutions, but these rate matrices are not scaled such that a branch length of 1 yields one expected change. This scaling can be accomplished by multiplying every element of the matrix by the same factor, or simply by scaling the branch lengths. If we use the β to denote the scaling factor, and ν to denote the branch length measured in the expected number of substitutions per site then βν is used in the transition probability formulae below in place of μt. Note that ν is a parameter to be estimated from data, and is referred to as the branch length, while β is simply a number that can be calculated from the rate matrix (it is not a separate free parameter).

The value of β can be found by forcing the expected rate of flux of states to 1. The diagonal entries of the rate-matrix (the Q matrix) represent -1 times the rate of leaving each state. For time-reversible models, we know the equilibrium state frequencies (these are simply the π_i parameter value for state i). Thus we can find the expected rate of change by calculating the sum of flux out of each state weighted by the proportion of sites that are expected to be in that class. Setting β to be the reciprocal of this sum will guarantee that scaled process has an expected flux of 1:

\beta =1/\left(-\sum _{i}\pi _{i}\mu _{ii}\right)

For example, in the Jukes-Cantor, the scaling factor would be 4/(3μ) because the rate of leaving each state is 3μ/4.

Most common models of DNA evolution

JC69 model (Jukes and Cantor 1969)

JC69, the Jukes and Cantor 1969 model,^[2] is the simplest substitution model. There are several assumptions. It assumes equal base frequencies

\left(\pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}={1 \over 4}\right)

and equal mutation rates. The only parameter of this model is therefore

\mu

, the overall substitution rate. As previously mentioned, this variable becomes a constant when we normalize the mean-rate to 1.

{\displaystyle Q={\begin{pmatrix}{*}&{\mu \over 4}&{\mu \over 4}&{\mu \over 4}\\{\mu \over 4}&{*}&{\mu \over 4}&{\mu \over 4}\\{\mu \over 4}&{\mu \over 4}&{*}&{\mu \over 4}\\{\mu \over 4}&{\mu \over 4}&{\mu \over 4}&{*}\end{pmatrix}}}

Probability

P_{ij}

of changing from initial state

i

to final state

j

as a function of the branch length (

\nu

) for JC69. Red curve: nucleotide states

i

and

j

are different. Blue curve: initial and final states are the same. After a long time, probabilities tend to the nucleotide equilibrium frequencies (0.25: dashed line).

{\displaystyle P={\begin{pmatrix}{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}\\\\{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}-{1 \over 4}e^{-t\mu }}&{{1 \over 4}+{3 \over 4}e^{-t\mu }}\end{pmatrix}}}

When branch length,

\nu

, is measured in the expected number of changes per site then:

{\displaystyle P_{ij}(\nu )=\left\{{\begin{array}{cc}{1 \over 4}+{3 \over 4}e^{-4\nu /3}&{\mbox{ if }}i=j\\{1 \over 4}-{1 \over 4}e^{-4\nu /3}&{\mbox{ if }}i\neq j\end{array}}\right.}

It is worth noticing that

\nu ={3 \over 4}t\mu =({\mu  \over 4}+{\mu  \over 4}+{\mu  \over 4})t

what stands for sum of any column (or row) of matrix

Q

multiplied by time and thus means expected number of substitutions in time

t

(branch duration) for each particular site (per site) when the rate of substitution equals

\mu

.

Given the proportion

p

of sites that differ between the two sequences the Jukes-Cantor estimate of the evolutionary distance (in terms of the expected number of changes) between two sequences is given by

{\hat {d}}=-{3 \over 4}\ln({1-{4 \over 3}p})={\hat {\nu }}

The

p

in this formula is frequently referred to as the

p

-distance. It is a sufficient statistic for calculating the Jukes-Cantor distance correction, but is not sufficient for the calculation of the evolutionary distance under the more complex models that follow (also note that

p

used in subsequent formulae is not identical to the "

p

-distance").

K80 model (Kimura 1980)

K80, the Kimura 1980 model,^[3] distinguishes between transitions (

A\leftrightarrow G

, i.e. from purine to purine, or

C\leftrightarrow T

, i.e. from pyrimidine to pyrimidine) and transversions (from purine to pyrimidine or vice versa). In Kimura's original description of the model the α and β were used to denote the rates of these types of substitutions, but it is now more common to set the rate of transversions to 1 and use κ to denote the transition/transversion rate ratio (as is done below). The K80 model assumes that all of the bases are equally frequent (

\pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}=0.25

).

Rate matrix

{\displaystyle Q={\begin{pmatrix}{*}&{\kappa }&{1}&{1}\\{\kappa }&{*}&{1}&{1}\\{1}&{1}&{*}&{\kappa }\\{1}&{1}&{\kappa }&{*}\end{pmatrix}}}

The Kimura two-parameter distance is given by:

K=-{1 \over 2}\ln((1-2p-q){\sqrt {1-2q}})

where p is the proportion of sites that show transitional differences and q is the proportion of sites that show transversional differences.

F81 model (Felsenstein 1981)

F81, the Felsenstein's 1981 model,^[4] is an extension of the JC69 model in which base frequencies are allowed to vary from 0.25 (

\pi _{A}\neq \pi _{G}\neq \pi _{C}\neq \pi _{T}

)

Rate matrix:

{\displaystyle Q={\begin{pmatrix}{*}&{\pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\pi _{C}}&{*}\end{pmatrix}}}

When branch length, ν, is measured in the expected number of changes per site then:

\beta =1/(1-\pi _{A}^{2}-\pi _{C}^{2}-\pi _{G}^{2}-\pi _{T}^{2})

{\displaystyle P_{ij}(\nu )=\left\{{\begin{array}{cc}e^{-\beta \nu }+\pi _{j}\left(1-e^{-\beta \nu }\right)&{\mbox{ if }}i=j\\\pi _{j}\left(1-e^{-\beta \nu }\right)&{\mbox{ if }}i\neq j\end{array}}\right.}

HKY85 model (Hasegawa, Kishino and Yano 1985)

HKY85, the Hasegawa, Kishino and Yano 1985 model,^[5] can be thought of as combining the extensions made in the Kimura80 and Felsenstein81 models. Namely, it distinguishes between the rate of transitions and transversions (using the κ parameter), and it allows unequal base frequencies (

\pi _{A}\neq \pi _{G}\neq \pi _{C}\neq \pi _{T}

). [ Felsenstein described a similar (but not equivalent) model in 1984 using a different parameterization;^[6] that latter model is referred to as the F84 model.^[7] ]

Rate matrix

{\displaystyle Q={\begin{pmatrix}{*}&{\kappa \pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\kappa \pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\kappa \pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\kappa \pi _{C}}&{*}\end{pmatrix}}}

If we express the branch length, ν in terms of the expected number of changes per site then:

\beta ={\frac {1}{2(\pi _{A}+\pi _{G})(\pi _{C}+\pi _{T})+2\kappa [(\pi _{A}\pi _{G})+(\pi _{C}\pi _{T})]}}

{\displaystyle P_{AA}(\nu ,\kappa ,\pi )=\left[\pi _{A}\left(\pi _{A}+\pi _{G}+(\pi _{C}+\pi _{T})e^{-\beta \nu }\right)+\pi _{G}e^{-(1+(\pi _{A}+\pi _{G})(\kappa -1.0))\beta \nu }\right]/(\pi _{A}+\pi _{G})}

P_{AC}(\nu ,\kappa ,\pi )=\pi _{C}\left(1.0-e^{-\beta \nu }\right)

{\displaystyle P_{AG}(\nu ,\kappa ,\pi )=\left[\pi _{G}\left(\pi _{A}+\pi _{G}+(\pi _{C}+\pi _{T})e^{-\beta \nu }\right)-\pi _{G}e^{-(1+(\pi _{A}+\pi _{G})(\kappa -1.0))\beta \nu }\right]/\left(\pi _{A}+\pi _{G}\right)}

P_{AT}(\nu ,\kappa ,\pi )=\pi _{T}\left(1.0-e^{-\beta \nu }\right)

and formula for the other combinations of states can be obtained by substituting in the appropriate base frequencies.

T92 model (Tamura 1992)

T92, the Tamura 1992 model,^[8] is a mathematical method developed to estimate the number of nucleotide substitutions per site between two DNA sequences, by extending Kimura’s (1980) two-parameter method to the case where a G+C content bias exists. This method will be useful when there are strong transition-transversion and G+C-content biases, as in the case of Drosophila mitochondrial DNA.^[8]

T92 involves a single, compound base frequency parameter

\theta \in (0,1)

(also noted

\pi _{GC}

)

=\pi _{G}+\pi _{C}=1-(\pi _{A}+\pi _{T})

As T92 echoes the Chargaff's second parity rule — pairing nucleotides do have the same frequency on a single DNA strand, G and C on the one hand, and A and T on the other hand — it follows that the four base frequences can be expressed as a function of

\pi _{GC}

\pi _{G}=\pi _{C}={\pi _{GC} \over 2}

\pi _{G}=\pi _{C}={\pi _{GC} \over 2}

and

\pi _{A}=\pi _{T}={(1-\pi _{GC}) \over 2}

Rate matrix

{\displaystyle Q={\begin{pmatrix}{*}&{\kappa \pi _{GC}/2}&{\pi _{GC}/2}&{(1-\pi _{GC})/2}\\{\kappa (1-\pi _{GC})/2}&{*}&{\pi _{GC}/2}&{(1-\pi _{GC})/2}\\{(1-\pi _{GC})/2}&{\pi _{GC}/2}&{*}&{\kappa (1-\pi _{GC})/2}\\{(1-\pi _{GC})/2}&{\pi _{GC}/2}&{\kappa \pi _{GC}/2}&{*}\end{pmatrix}}}

The evolutionary distance between two DNA sequences according to this model is given by

d=-h\ln(1-{p \over h}-q)-{1 \over 2}(1-h)\ln(1-2q)

where

h=2\theta (1-\theta )

and

\theta

is the G+C content (

\pi _{GC}=\pi _{G}+\pi _{C}

TN93 model (Tamura and Nei 1993)

TN93, the Tamura and Nei 1993 model,^[9] distinguishes between the two different types of transition - i.e. (

A\leftrightarrow G

) is allowed to have a different rate to (

C\leftrightarrow T

). Transversions are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions.
TN93 also allows unequal base frequencies (

\pi _{A}\neq \pi _{G}\neq \pi _{C}\neq \pi _{T}

).

Rate matrix

{\displaystyle Q={\begin{pmatrix}{*}&{\kappa _{1}\pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\kappa _{1}\pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\kappa _{2}\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\kappa _{2}\pi _{C}}&{*}\end{pmatrix}}}

GTR model (Tavaré 1986)

GTR, the Generalised time-reversible model of Tavaré 1986,^[10] is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.^[10]

GTR parameters consist of an equilibrium base frequency vector,

\Pi =(\pi _{A},\pi _{G},\pi _{C},\pi _{T})

, giving the frequency at which each base occurs at each site, and the rate matrix

{\displaystyle Q={\begin{pmatrix}{-(\alpha \pi _{G}+\beta \pi _{C}+\gamma \pi _{T})}&{\alpha \pi _{G}}&{\beta \pi _{C}}&{\gamma \pi _{T}}\\{\alpha \pi _{A}}&{-(\alpha \pi _{A}+\delta \pi _{C}+\epsilon \pi _{T})}&{\delta \pi _{C}}&{\epsilon \pi _{T}}\\{\beta \pi _{A}}&{\delta \pi _{G}}&{-(\beta \pi _{A}+\delta \pi _{G}+\eta \pi _{T})}&{\eta \pi _{T}}\\{\gamma \pi _{A}}&{\epsilon \pi _{G}}&{\eta \pi _{C}}&{-(\gamma \pi _{A}+\epsilon \pi _{G}+\eta \pi _{C})}\end{pmatrix}}}

Where

{\displaystyle {\begin{aligned}\alpha =r(A\rightarrow G)=r(G\rightarrow A)\\\beta =r(A\rightarrow C)=r(C\rightarrow A)\\\gamma =r(A\rightarrow T)=r(T\rightarrow A)\\\delta =r(G\rightarrow C)=r(C\rightarrow G)\\\epsilon =r(G\rightarrow T)=r(T\rightarrow G)\\\eta =r(C\rightarrow T)=r(T\rightarrow C)\end{aligned}}}

are the transition rate parameters.

Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus

\mu

, the overall number of substitutions per unit time. When measuring time in substitutions (

\mu

=1) only 8 free parameters remain.

In general, to compute the number of parameters, one must count the number of entries above the diagonal in the matrix, i.e. for n trait values per site

{{n^{2}-n} \over 2}

, and then add n for the equilibrium base frequencies, and subtract 1 because

\mu

is fixed. One gets

{{n^{2}-n} \over 2}+n-1={1 \over 2}n^{2}+{1 \over 2}n-1.

For example, for an amino acid sequence (there are 20 "standard" amino acids that make up proteins), one would find there are 209 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model (a codon is three bases and codes for one amino acid in a protein). There are

4^{3}=64

codons, but the rates for transitions between codons which differ by more than one base is assumed to be zero. Hence, there are

{{20\times 19\times 3} \over 2}+64-1=633

parameters.

Molecular clock

From Wikipedia, the free encyclopedia

The molecular clock is a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleotide sequences for DNA or amino acid sequences for proteins. The benchmarks for determining the mutation rate are often fossil or archaeological dates. The molecular clock was first tested in 1962 on the hemoglobin protein variants of various animals, and is commonly used in molecular evolution to estimate times of speciation or radiation. It is sometimes called a gene clock or an evolutionary clock.

Early discovery and genetic equidistance

The notion of the existence of a so-called "molecular clock" was first attributed to Émile Zuckerkandl and Linus Pauling who, in 1962, noticed that the number of amino acid differences in hemoglobin between different lineages changes roughly linearly with time, as estimated from fossil evidence.^[1] They generalized this observation to assert that the rate of evolutionary change of any specified protein was approximately constant over time and over different lineages (based on the molecular clock hypothesis (MCH)).

The genetic equidistance phenomenon was first noted in 1963 by Emanuel Margoliash, who wrote: "It appears that the number of residue differences between cytochrome c of any two species is mostly conditioned by the time elapsed since the lines of evolution leading to these two species originally diverged. If this is correct, the cytochrome c of all mammals should be equally different from the cytochrome c of all birds. Since fish diverges from the main stem of vertebrate evolution earlier than either birds or mammals, the cytochrome c of both mammals and birds should be equally different from the cytochrome c of fish. Similarly, all vertebrate cytochrome c should be equally different from the yeast protein."^[2] For example, the difference between the cytochrome c of a carp and a frog, turtle, chicken, rabbit, and horse is a very constant 13% to 14%. Similarly, the difference between the cytochrome c of a bacterium and yeast, wheat, moth, tuna, pigeon, and horse ranges from 64% to 69%. Together with the work of Emile Zuckerkandl and Linus Pauling, the genetic equidistance result directly led to the formal postulation of the molecular clock hypothesis in the early 1960s.^[3]

Relationship with neutral theory

The observation of a clock-like rate of molecular change was originally purely phenomenological. Later, the work of Motoo Kimura^[4] developed the neutral theory of molecular evolution, which predicted a molecular clock. Let there be N individuals, and to keep this calculation simple, let the individuals be haploid (i.e. have one copy of each gene). Let the rate of neutral mutations (i.e. mutations with no effect on fitness) in a new individual be

\mu

. The probability that this new mutation will become fixed in the population is then 1/N, since each copy of the gene is as good as any other. Every generation, each individual can have new mutations, so there are

\mu

N new neutral mutations in the population as a whole. That means that each generation,

\mu

new neutral mutations will become fixed. If most changes seen during molecular evolution are neutral, then fixations in a population will accumulate at a clock-rate that is equal to the rate of neutral mutations in an individual.

Calibration

The molecular clock alone can only say that one time period is twice as long as another: it cannot assign concrete dates. For viral phylogenetics and ancient DNA studies—two areas of evolutionary biology where it is possible to sample sequences over an evolutionary timescale—the dates of the intermediate samples can be used to more precisely calibrate the molecular clock. However, most phylogenies require that the molecular clock be calibrated against independent evidence about dates, such as the fossil record.^[5] There are two general methods for calibrating the molecular clock using fossil data: node calibration and tip calibration.^[6]

Node calibration

Sometimes referred to as node dating, node calibration is a method for phylogeny calibration that is done by placing fossil constraints at nodes. A node calibration fossil is the oldest discovered representative of that clade, which is used to constrain its minimum age. Due to the fragmentary nature of the fossil record, the true most recent common ancestor of a clade will likely never be found.^[6] In order to account for this in node calibration analyses, a maximum clade age must be estimated. Determining the maximum clade age is challenging because it relies on negative evidence—the absence of older fossils in that clade. There are a number of methods for deriving the maximum clade age using birth-death models, fossil stratigraphic distribution analyses, or taphonomic controls.^[7] Alternatively, instead of a maximum and a minimum, a prior probability of the divergence time can be established and used to calibrate the clock. There are several prior probability distributions including normal, lognormal, exponential, gamma, uniform, etc.) that can be used to express the probability of the true age of divergence relative to the age of the fossil;^[8] however, there are very few methods for estimating the shape and parameters of the probability distribution empirically.^[9] The placement of calibration nodes on the tree informs the placement of the unconstrained nodes, giving divergence date estimates across the phylogeny. Historical methods of clock calibration could only make use of a single fossil constraint (non-parametric rate smoothing),^[10] while modern analyses (BEAST^[11] and r8s^[12]) allow for the use of multiple fossils to calibrate the molecular clock. Simulation studies have shown that increasing the number of fossil constraints increases the accuracy of divergence time estimation.^[13]

Tip calibration

Sometimes referred to as tip dating, tip calibration is a method of molecular clock calibration in which fossils are treated as taxa and placed on the tips of the tree. This is achieved by creating a matrix that includes a molecular dataset for the extant taxa along with a morphological dataset for both the extinct and the extant taxa.^[7] Unlike node calibration, this method reconstructs the tree topology and places the fossils simultaneously. Molecular and morphological models work together simultaneously, allowing morphology to inform the placement of fossils.^[6] Tip calibration makes use of all relevant fossil taxa during clock calibration, rather than relying on only the oldest fossil of each clade. This method does not rely on the interpretation of negative evidence to infer maximum clade ages.^[7]

Total evidence dating

This approach to tip calibration goes a step further by simultaneously estimating fossil placement, topology, and the evolutionary timescale. In this method, the age of a fossil can inform its phylogenetic position in addition to morphology. By allowing all aspects of tree reconstruction to occur simultaneously, the risk of biased results is decreased.^[6] This approach has been improved upon by pairing it with different models. One current method of molecular clock calibration is total evidence dating paired with the fossilized birth-death (FBD) model and a model of morphological evolution.^[14] The FBD model is novel in that it allows for “sampled ancestors,” which are fossil taxa that are the direct ancestor of a living taxon or lineage. This allows fossils to be placed on a branch above an extant organism, rather than being confined to the tips.^[15]

Methods

Bayesian methods can provide more appropriate estimates of divergence times, especially if large datasets—such as those yielded by phylogenomics—are employed.^[16]

Non-constant rate of molecular clock

Sometimes only a single divergence date can be estimated from fossils, with all other dates inferred from that. Other sets of species have abundant fossils available, allowing the MCH of constant divergence rates to be tested. DNA sequences experiencing low levels of negative selection showed divergence rates of 0.7–0.8% per Myr in bacteria, mammals, invertebrates, and plants.^[17] In the same study, genomic regions experiencing very high negative or purifying selection (encoding rRNA) were considerably slower (1% per 50 Myr).

In addition to such variation in rate with genomic position, since the early 1990s variation among taxa has proven fertile ground for research too,^[18] even over comparatively short periods of evolutionary time (for example mockingbirds^[19]). Tube-nosed seabirds have molecular clocks that on average run at half speed of many other birds,^[20] possibly due to long generation times, and many turtles have a molecular clock running at one-eighth the speed it does in small mammals, or even slower.^[21] Effects of small population size are also likely to confound molecular clock analyses. Researchers such as Francisco Ayala have more fundamentally challenged the molecular clock hypothesis.^[22]^[23] According to Ayala's 1999 study, five factors combine to limit the application of molecular clock models:

Changing generation times (If the rate of new mutations depends at least partly on the number of generations rather than the number of years)
Population size (Genetic drift is stronger in small populations, and so more mutations are effectively neutral)
Species-specific differences (due to differing metabolism, ecology, evolutionary history, ...)
Change in function of the protein studied (can be avoided in closely related species by utilizing non-coding DNA sequences or emphasizing silent mutations)
Changes in the intensity of natural selection.

Woody bamboos (tribes Arundinarieae and Bambuseae) have long generation times and lower mutation rates, as expressed by short branches in the phylogenetic tree, than the fast-evolving herbaceous bamboos (Olyreae).

Molecular clock users have developed workaround solutions using a number of statistical approaches including maximum likelihood techniques and later Bayesian modeling. In particular, models that take into account rate variation across lineages have been proposed in order to obtain better estimates of divergence times. These models are called relaxed molecular clocks^[24] because they represent an intermediate position between the 'strict' molecular clock hypothesis and Joseph Felsenstein's many-rates model^[25] and are made possible through MCMC techniques that explore a weighted range of tree topologies and simultaneously estimate parameters of the chosen substitution model. It must be remembered that divergence dates inferred using a molecular clock are based on statistical inference and not on direct evidence.

The molecular clock runs into particular challenges at very short and very long timescales. At long timescales, the problem is saturation. When enough time has passed, many sites have undergone more than one change, but it is impossible to detect more than one. This means that the observed number of changes is no longer linear with time, but instead flattens out. Even at intermediate genetic distances, with phylogenetic data still sufficient to estimate topology, signal for the overall scale of the tree can be weak under complex likelihood models, leading to highly uncertain molecular clock estimates.^[26]

At very short time scales, many differences between samples do not represent fixation of different sequences in the different populations. Instead, they represent alternative alleles that were both present as part of a polymorphism in the common ancestor. The inclusion of differences that have not yet become fixed leads to a potentially dramatic inflation of the apparent rate of the molecular clock at very short timescales.^[27]^[28]

Uses

The molecular clock technique is an important tool in molecular systematics, the use of molecular genetics information to determine the correct scientific classification of organisms or to study variation in selective forces. Knowledge of approximately constant rate of molecular evolution in particular sets of lineages also facilitates establishing the dates of phylogenetic events, including those not documented by fossils, such as the divergence of living taxa and the formation of the phylogenetic tree. In these cases—especially over long stretches of time—the limitations of MCH (above) must be considered; such estimates may be off by 50% or more.