A Medley of Potpourri

Thursday, October 25, 2018

Fossil

From Wikipedia, the free encyclopedia

A fossil (from Classical Latin fossilis; literally, "obtained by digging") is any preserved remains, impression, or trace of any once-living thing from a past geological age. Examples include bones, shells, exoskeletons, stone imprints of animals or microbes, objects preserved in amber, hair, petrified wood, oil, coal, and DNA remnants. The totality of fossils is known as the fossil record.

Paleontology is the study of fossils: their age, method of formation, and evolutionary significance. Specimens are usually considered to be fossils if they are over 10,000 years old. The oldest fossils are from around 3.48 billion years old to 4.1 billion years old. The observation in the 19th century that certain fossils were associated with certain rock strata led to the recognition of a geological timescale and the relative ages of different fossils. The development of radiometric dating techniques in the early 20th century allowed scientists to quantitatively measure the absolute ages of rocks and the fossils they host.

There are many processes that lead to fossilization, including permineralization, casts and molds, authigenic mineralization, replacement and recrystallization, adpression, carbonization, and bioimmuration.

Fossil of a Seymouria (extinct)

Fossils vary in size from one micrometer bacteria to dinosaurs and trees, many meters long and weighing many tons. A fossil normally preserves only a portion of the deceased organism, usually that portion that was partially mineralized during life, such as the bones and teeth of vertebrates, or the chitinous or calcareous exoskeletons of invertebrates. Fossils may also consist of the marks left behind by the organism while it was alive, such as animal tracks or feces (coprolites). These types of fossil are called trace fossils or ichnofossils, as opposed to body fossils. Some fossils are biochemical and are called chemofossils or biosignatures.

Fossilization processes

The process of fossilization varies according to tissue type and external conditions.

Permineralization

Silicified (replaced with silica) fossils from the Road Canyon Formation (Middle Permian of Texas)

Permineralization is a process of fossilization that occurs when an organism is buried. The empty spaces within an organism (spaces filled with liquid or gas during life) become filled with mineral-rich groundwater. Minerals precipitate from the groundwater, occupying the empty spaces. This process can occur in very small spaces, such as within the cell wall of a plant cell. Small scale permineralization can produce very detailed fossils. For permineralization to occur, the organism must become covered by sediment soon after death or soon after the initial decay process. The degree to which the remains are decayed when covered determines the later details of the fossil. Some fossils consist only of skeletal remains or teeth; other fossils contain traces of skin, feathers or even soft tissues. This is a form of diagenesis.

Casts and molds

External mold of a bivalve from the Logan Formation, Lower Carboniferous, Ohio

In some cases the original remains of the organism completely dissolve or are otherwise destroyed. The remaining organism-shaped hole in the rock is called an external mold. If this hole is later filled with other minerals, it is a cast. An endocast or internal mold is formed when sediments or minerals fill the internal cavity of an organism, such as the inside of a bivalve or snail or the hollow of a skull.

Authigenic mineralization

This is a special form of cast and mold formation. If the chemistry is right, the organism (or fragment of organism) can act as a nucleus for the precipitation of minerals such as siderite, resulting in a nodule forming around it. If this happens rapidly before significant decay to the organic tissue, very fine three-dimensional morphological detail can be preserved. Nodules from the Carboniferous Mazon Creek fossil beds of Illinois, USA, are among the best documented examples of such mineralization.

Replacement and recrystallization

Recrystallized scleractinian coral (aragonite to calcite) from the Jurassic of southern Israel

Replacement occurs when the shell, bone or other tissue is replaced with another mineral. In some cases mineral replacement of the original shell occurs so gradually and at such fine scales that microstructural features are preserved despite the total loss of original material. A shell is said to be recrystallized when the original skeletal compounds are still present but in a different crystal form, as from aragonite to calcite.

Adpression (compression-impression)

Compression fossils, such as those of fossil ferns, are the result of chemical reduction of the complex organic molecules composing the organism's tissues. In this case the fossil consists of original material, albeit in a geochemically altered state. This chemical change is an expression of diagenesis. Often what remains is a carbonaceous film known as a phytoleim, in which case the fossil is known as a compression. Often, however, the phytoleim is lost and all that remains is an impression of the organism in the rock—an impression fossil. In many cases, however, compressions and impressions occur together. For instance, when the rock is broken open, the phytoleim will often be attached to one part (compression), whereas the counterpart will just be an impression. For this reason, one term covers the two modes of preservation: adpression.

Soft tissue, cell and molecular preservation

Because of their antiquity, an unexpected exception to the alteration of an organism's tissues by chemical reduction of the complex organic molecules during fossilization has been the discovery of soft tissue in dinosaur fossils, including blood vessels, and the isolation of proteins and evidence for DNA fragments. In 2014, Mary Schweitzer and her colleagues reported the presence of iron particles (goethite-aFeO(OH)) associated with soft tissues recovered from dinosaur fossils. Based on various experiments that studied the interaction of iron in haemoglobin with blood vessel tissue they proposed that solution hypoxia coupled with iron chelation enhances the stability and preservation of soft tissue and provides the basis for an explanation for the unforeseen preservation of fossil soft tissues. However, a slightly older study based on eight taxa ranging in time from the Devonian to the Jurassic found that reasonably well-preserved fibrils that probably represent collagen were preserved in all these fossils, and that the quality of preservation depended mostly on the arrangement of the collagen fibers, with tight packing favoring good preservation. There seemed to be no correlation between geological age and quality of preservation, within that timeframe.

Carbonization

Carbonaceous films are thin coatings which consist predominantly of the chemical element carbon. The soft tissues of organisms are made largely of organic carbon compounds and during diagenesis under reducing conditions only a thin film of carbon residue is left which forms a silhouette of the original organism.

Bioimmuration

The star-shaped holes (Catellocaula vallata) in this Upper Ordovician bryozoan represent a soft-bodied organism preserved by bioimmuration in the bryozoan skeleton.

Bioimmuration occurs when a skeletal organism overgrows or otherwise subsumes another organism, preserving the latter, or an impression of it, within the skeleton. Usually it is a sessile skeletal organism, such as a bryozoan or an oyster, which grows along a substrate, covering other sessile sclerobionts. Sometimes the bioimmured organism is soft-bodied and is then preserved in negative relief as a kind of external mold. There are also cases where an organism settles on top of a living skeletal organism that grows upwards, preserving the settler in its skeleton. Bioimmuration is known in the fossil record from the Ordovician to the Recent.

Dating

Estimating dates

Paleontology seeks to map out how life evolved across geologic time. A substantial hurdle is the difficulty of working out fossil ages. Beds that preserve fossils typically lack the radioactive elements needed for radiometric dating. This technique is our only means of giving rocks greater than about 50 million years old an absolute age, and can be accurate to within 0.5% or better. Although radiometric dating requires careful laboratory work, its basic principle is simple: the rates at which various radioactive elements decay are known, and so the ratio of the radioactive element to its decay products shows how long ago the radioactive element was incorporated into the rock. Radioactive elements are common only in rocks with a volcanic origin, and so the only fossil-bearing rocks that can be dated radiometrically are volcanic ash layers, which may provide termini for the intervening sediments.

Stratigraphy

Consequently, palaeontologists rely on stratigraphy to date fossils. Stratigraphy is the science of deciphering the "layer-cake" that is the sedimentary record. Rocks normally form relatively horizontal layers, with each layer younger than the one underneath it. If a fossil is found between two layers whose ages are known, the fossil's age is claimed to lie between the two known ages. Because rock sequences are not continuous, but may be broken up by faults or periods of erosion, it is very difficult to match up rock beds that are not directly adjacent. However, fossils of species that survived for a relatively short time can be used to match isolated rocks: this technique is called biostratigraphy. For instance, the conodont Eoplacognathus pseudoplanus has a short range in the Middle Ordovician period. If rocks of unknown age have traces of E. pseudoplanus, they have a mid-Ordovician age. Such index fossils must be distinctive, be globally distributed and occupy a short time range to be useful. Misleading results are produced if the index fossils are incorrectly dated. Stratigraphy and biostratigraphy can in general provide only relative dating (A was before B), which is often sufficient for studying evolution. However, this is difficult for some time periods, because of the problems involved in matching rocks of the same age across continents. Family-tree relationships also help to narrow down the date when lineages first appeared. For instance, if fossils of B or C date to X million years ago and the calculated "family tree" says A was an ancestor of B and C, then A must have evolved earlier.

It is also possible to estimate how long ago two living clades diverged, in other words approximately how long ago their last common ancestor must have lived, by assuming that DNA mutations accumulate at a constant rate. These "molecular clocks", however, are fallible, and provide only approximate timing: for example, they are not sufficiently precise and reliable for estimating when the groups that feature in the Cambrian explosion first evolved, and estimates produced by different techniques may vary by a factor of two.

Limitations

Some of the most remarkable gaps in the fossil record (as of October 2013) show slanting toward organisms with hard parts.

Organisms are only rarely preserved as fossils in the best of circumstances, and only a fraction of such fossils have been discovered. This is illustrated by the fact that the number of species known through the fossil record is less than 5% of the number of known living species, suggesting that the number of species known through fossils must be far less than 1% of all the species that have ever lived. Because of the specialized and rare circumstances required for a biological structure to fossilize, only a small percentage of life-forms can be expected to be represented in discoveries, and each discovery represents only a snapshot of the process of evolution. The transition itself can only be illustrated and corroborated by transitional fossils, which will never demonstrate an exact half-way point.

The fossil record is strongly biased toward organisms with hard-parts, leaving most groups of soft-bodied organisms with little to no role. It is replete with the mollusks, the vertebrates, the echinoderms, the brachiopods and some groups of arthropods.

Sites

Lagerstätten

Fossil sites with exceptional preservation—sometimes including preserved soft tissues—are known as Lagerstätten - German for "storage places". These formations may have resulted from carcass burial in an anoxic environment with minimal bacteria, thus slowing decomposition. Lagerstätten span geological time from the Cambrian period to the present. Worldwide, some of the best examples of near-perfect fossilization are the Cambrian Maotianshan shales and Burgess Shale, the Devonian Hunsrück Slates, the Jurassic Solnhofen limestone, and the Carboniferous Mazon Creek localities.

Stromatolites

Lower Proterozoic stromatolites from Bolivia, South America

Stromatolites are layered accretionary structures formed in shallow water by the trapping, binding and cementation of sedimentary grains by biofilms of microorganisms, especially cyanobacteria. Stromatolites provide some of the most ancient fossil records of life on Earth, dating back more than 3.5 billion years ago.

Stromatolites were much more abundant in Precambrian times. While older, Archean fossil remains are presumed to be colonies of cyanobacteria, younger (that is, Proterozoic) fossils may be primordial forms of the eukaryote chlorophytes (that is, green algae). One genus of stromatolite very common in the geologic record is Collenia. The earliest stromatolite of confirmed microbial origin dates to 2.724 billion years ago.

A 2009 discovery provides strong evidence of microbial stromatolites extending as far back as 3.45 billion years ago.

Stromatolites are a major constituent of the fossil record for life's first 3.5 billion years, peaking about 1.25 billion years ago. They subsequently declined in abundance and diversity, which by the start of the Cambrian had fallen to 20% of their peak. The most widely supported explanation is that stromatolite builders fell victims to grazing creatures (the Cambrian substrate revolution), implying that sufficiently complex organisms were common over 1 billion years ago.

The connection between grazer and stromatolite abundance is well documented in the younger Ordovician evolutionary radiation; stromatolite abundance also increased after the end-Ordovician and end-Permian extinctions decimated marine animals, falling back to earlier levels as marine animals recovered. Fluctuations in metazoan population and diversity may not have been the only factor in the reduction in stromatolite abundance. Factors such as the chemistry of the environment may have been responsible for changes.

While prokaryotic cyanobacteria themselves reproduce asexually through cell division, they were instrumental in priming the environment for the evolutionary development of more complex eukaryotic organisms. Cyanobacteria (as well as extremophile Gammaproteobacteria) are thought to be largely responsible for increasing the amount of oxygen in the primeval earth's atmosphere through their continuing photosynthesis. Cyanobacteria use water, carbon dioxide and sunlight to create their food. A layer of mucus often forms over mats of cyanobacterial cells. In modern microbial mats, debris from the surrounding habitat can become trapped within the mucus, which can be cemented by the calcium carbonate to grow thin laminations of limestone. These laminations can accrete over time, resulting in the banded pattern common to stromatolites. The domal morphology of biological stromatolites is the result of the vertical growth necessary for the continued infiltration of sunlight to the organisms for photosynthesis. Layered spherical growth structures termed oncolites are similar to stromatolites and are also known from the fossil record. Thrombolites are poorly laminated or non-laminated clotted structures formed by cyanobacteria common in the fossil record and in modern sediments.

The Zebra River Canyon area of the Kubis platform in the deeply dissected Zaris Mountains of southwestern Namibia provides an extremely well exposed example of the thrombolite-stromatolite-metazoan reefs that developed during the Proterozoic period, the stromatolites here being better developed in updip locations under conditions of higher current velocities and greater sediment influx.

Types

Index

Examples of index fossils

Index fossils (also known as guide fossils, indicator fossils or zone fossils) are fossils used to define and identify geologic periods (or faunal stages). They work on the premise that, although different sediments may look different depending on the conditions under which they were deposited, they may include the remains of the same species of fossil. The shorter the species' time range, the more precisely different sediments can be correlated, and so rapidly evolving species' fossils are particularly valuable. The best index fossils are common, easy to identify at species level and have a broad distribution—otherwise the likelihood of finding and recognizing one in the two sediments is poor.

Trace

Cambrian trace fossils including Rusophycus, made by a trilobite

A coprolite of a carnivorous dinosaur found in southwestern Saskatchewan

Trace fossils consist mainly of tracks and burrows, but also include coprolites (fossil feces) and marks left by feeding. Trace fossils are particularly significant because they represent a data source that is not limited to animals with easily fossilized hard parts, and they reflect animal behaviours. Many traces date from significantly earlier than the body fossils of animals that are thought to have been capable of making them. Whilst exact assignment of trace fossils to their makers is generally impossible, traces may for example provide the earliest physical evidence of the appearance of moderately complex animals (comparable to earthworms).

Coprolites are classified as trace fossils as opposed to body fossils, as they give evidence for the animal's behaviour (in this case, diet) rather than morphology. They were first described by William Buckland in 1829. Prior to this they were known as "fossil fir cones" and "bezoar stones." They serve a valuable purpose in paleontology because they provide direct evidence of the predation and diet of extinct organisms. Coprolites may range in size from a few millimetres to over 60 centimetres.

Transitional

A transitional fossil is any fossilized remains of a life form that exhibits traits common to both an ancestral group and its derived descendant group. This is especially important where the descendant group is sharply differentiated by gross anatomy and mode of living from the ancestral group. Because of the incompleteness of the fossil record, there is usually no way to know exactly how close a transitional fossil is to the point of divergence. These fossils serve as a reminder that taxonomic divisions are human constructs that have been imposed in hindsight on a continuum of variation.

Microfossils

Microfossils about 1 mm

Microfossil is a descriptive term applied to fossilized plants and animals whose size is just at or below the level at which the fossil can be analyzed by the naked eye. A commonly applied cutoff point between "micro" and "macro" fossils is 1 mm. Microfossils may either be complete (or near-complete) organisms in themselves (such as the marine plankters foraminifera and coccolithophores) or component parts (such as small teeth or spores) of larger animals or plants. Microfossils are of critical importance as a reservoir of paleoclimate information, and are also commonly used by biostratigraphers to assist in the correlation of rock units.

Resin

Leptofoenus pittfieldae trapped in Dominican amber, from 20 to 16 million years ago

Fossil resin (colloquially called amber) is a natural polymer found in many types of strata throughout the world, even the Arctic. The oldest fossil resin dates to the Triassic, though most dates to the Cenozoic. The excretion of the resin by certain plants is thought to be an evolutionary adaptation for protection from insects and to seal wounds. Fossil resin often contains other fossils called inclusions that were captured by the sticky resin. These include bacteria, fungi, other plants, and animals. Animal inclusions are usually small invertebrates, predominantly arthropods such as insects and spiders, and only extremely rarely a vertebrate such as a small lizard. Preservation of inclusions can be exquisite, including small fragments of DNA.

Derived

Eroded Jurassic plesiosaur vertebral centrum found in the Lower Cretaceous Faringdon Sponge Gravels in Faringdon, England. An example of a remanié fossil

A derived, reworked or remanié fossil is a fossil found in rock that accumulated significantly later than when the fossilized animal or plant died. Reworked fossils are created by erosion exhuming (freeing) fossils from the rock formation in which they were originally deposited and their redeposition in an younger sedimentary deposit.

Wood

Petrified wood. The internal structure of the tree and bark are maintained in the permineralization process

Polished section of petrified wood showing annual rings

Fossil wood is wood that is preserved in the fossil record. Wood is usually the part of a plant that is best preserved (and most easily found). Fossil wood may or may not be petrified. The fossil wood may be the only part of the plant that has been preserved: therefore such wood may get a special kind of botanical name. This will usually include "xylon" and a term indicating its presumed affinity, such as Araucarioxylon (wood of Araucaria or some related genus), Palmoxylon (wood of an indeterminate palm), or Castanoxylon (wood of an indeterminate chinkapin).

Subfossil

A subfossil dodo skeleton

The term subfossil can be used to refer to remains, such as bones, nests, or defecations, whose fossilization process is not complete, either because the length of time since the animal involved was living is too short (less than 10,000 years) or because the conditions in which the remains were buried were not optimal for fossilization. Subfossils are often found in caves or other shelters where they can be preserved for thousands of years. The main importance of subfossil vs. fossil remains is that the former contain organic material, which can be used for radiocarbon dating or extraction and sequencing of DNA, protein, or other biomolecules. Additionally, isotope ratios can provide much information about the ecological conditions under which extinct animals lived. Subfossils are useful for studying the evolutionary history of an environment and can be important to studies in paleoclimatology.

Subfossils are often found in depositionary environments, such as lake sediments, oceanic sediments, and soils. Once deposited, physical and chemical weathering can alter the state of preservation.

Chemical fossils

Chemical fossils, or chemofossils, are chemicals found in rocks and fossil fuels (petroleum, coal, and natural gas) that provide an organic signature for ancient life. Molecular fossils and isotope ratios represent two types of chemical fossils. The oldest traces of life on Earth are fossils of this type, including carbon isotope anomalies found in zircons that imply the existence of life as early as 4.1 billion years ago.

Astrobiology

It has been suggested that biominerals could be important indicators of extraterrestrial life and thus could play an important role in the search for past or present life on the planet Mars. Furthermore, organic components (biosignatures) that are often associated with biominerals are believed to play crucial roles in both pre-biotic and biotic reactions.

On 24 January 2014, NASA reported that current studies by the Curiosity and Opportunity rovers on Mars will now be searching for evidence of ancient life, including a biosphere based on autotrophic, chemotrophic and/or chemolithoautotrophic microorganisms, as well as ancient water, including fluvio-lacustrine environments (plains related to ancient rivers or lakes) that may have been habitable.^[51]^[52]^[53]^[54] The search for evidence of habitability, taphonomy (related to fossils), and organic carbon on the planet Mars is now a primary NASA objective.

Pseudofossils

An example of a pseudofossil: Manganese dendrites on a limestone bedding plane from Solnhofen, Germany; scale in mm

Pseudofossils are visual patterns in rocks that are produced by geologic processes rather than biologic processes. They can easily be mistaken for real fossils. Some pseudofossils, such as dendrites, are formed by naturally occurring fissures in the rock that get filled up by percolating minerals. Other types of pseudofossils are kidney ore (round shapes in iron ore) and moss agates, which look like moss or plant leaves. Concretions, spherical or ovoid-shaped nodules found in some sedimentary strata, were once thought to be dinosaur eggs, and are often mistaken for fossils as well.

History of the study of fossils

Gathering fossils dates at least to the beginning of recorded history. The fossils themselves are referred to as the fossil record. The fossil record was one of the early sources of data underlying the study of evolution and continues to be relevant to the history of life on Earth. Paleontologists examine the fossil record to understand the process of evolution and the way particular species have evolved.

Before Darwin

Many early explanations relied on folktales or mythologies. In China the fossil bones of ancient mammals including Homo erectus were often mistaken for "dragon bones" and used as medicine and aphrodisiacs. In addition, some of these fossil bones are collected as "art" by scholars and they left scripts on it, indicating the time they got the collection. One good example is the famous scholar Huang Tingjian of the South Song Dynasty during the 11th century, who kept one seashell fossil with his poem engraved on it. In the West fossilized sea creatures on mountainsides were seen as proof of the biblical deluge.

In 1027, the Persian Avicenna explained fossils' stoniness in The Book of Healing:

If what is said concerning the petrifaction of animals and plants is true, the cause of this (phenomenon) is a powerful mineralizing and petrifying virtue which arises in certain stony spots, or emanates suddenly from the earth during earthquake and subsidences, and petrifies whatever comes into contact with it. As a matter of fact, the petrifaction of the bodies of plants and animals is not more extraordinary than the transformation of waters.

The Greek scholar Aristotle realized that fossil seashells from rocks were similar to those found on the beach, indicating the fossils were once living animals. Aristotle previously explained it in terms of vaporous exhalations, which Avicenna modified into the theory of petrifying fluids (succus lapidificatus), later elaborated by Albert of Saxony in the 14th century and accepted in some form by most naturalists by the 16th century.

More scientific views of fossils emerged during the Renaissance. Leonardo da Vinci concurred with Aristotle's view that fossils were the remains of ancient life. For example, da Vinci noticed discrepancies with the biblical flood narrative as an explanation for fossil origins:

If the Deluge had carried the shells for distances of three and four hundred miles from the sea it would have carried them mixed with various other natural objects all heaped up together; but even at such distances from the sea we see the oysters all together and also the shellfish and the cuttlefish and all the other shells which congregate together, found all together dead; and the solitary shells are found apart from one another as we see them every day on the sea-shores.

And we find oysters together in very large families, among which some may be seen with their shells still joined together, indicating that they were left there by the sea and that they were still living when the strait of Gibraltar was cut through. In the mountains of Parma and Piacenza multitudes of shells and corals with holes may be seen still sticking to the rocks...."

Ichthyosaurus and Plesiosaurus from the 1834 Czech edition of Cuvier's Discours sur les revolutions de la surface du globe

Robert Hooke (1635-1703) included micrographs of fossils in his Micrographia and was among the first to observe fossil forams. His observations on fossils, which he stated to be the petrified remains of creatures some of which no longer existed, were published posthumously in 1705.

William Smith (1769–1839), an English canal engineer, observed that rocks of different ages (based on the law of superposition) preserved different assemblages of fossils, and that these assemblages succeeded one another in a regular and determinable order. He observed that rocks from distant locations could be correlated based on the fossils they contained. He termed this the principle of faunal succession. This principle became one of Darwin's chief pieces of evidence that biological evolution was real.

Georges Cuvier came to believe that most if not all the animal fossils he examined were remains of extinct species. This led Cuvier to become an active proponent of the geological school of thought called catastrophism. Near the end of his 1796 paper on living and fossil elephants he said:

All of these facts, consistent among themselves, and not opposed by any report, seem to me to prove the existence of a world previous to ours, destroyed by some kind of catastrophe.

Interest in fossils, and geology more generally, expanded during the early nineteenth century. In Britain, Mary Anning's discoveries of fossils, including the first complete ichthyosaur and a complete plesiosaurus skeleton, sparked both public and scholarly interest.

Linnaeus and Darwin

Early naturalists well understood the similarities and differences of living species leading Linnaeus to develop a hierarchical classification system still in use today. Darwin and his contemporaries first linked the hierarchical structure of the tree of life with the then very sparse fossil record. Darwin eloquently described a process of descent with modification, or evolution, whereby organisms either adapt to natural and changing environmental pressures, or they perish.

When Darwin wrote On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, the oldest animal fossils were those from the Cambrian Period, now known to be about 540 million years old. He worried about the absence of older fossils because of the implications on the validity of his theories, but he expressed hope that such fossils would be found, noting that: "only a small portion of the world is known with accuracy." Darwin also pondered the sudden appearance of many groups (i.e. phyla) in the oldest known Cambrian fossiliferous strata.

After Darwin

Since Darwin's time, the fossil record has been extended to between 2.3 and 3.5 billion years. Most of these Precambrian fossils are microscopic bacteria or microfossils. However, macroscopic fossils are now known from the late Proterozoic. The Ediacara biota (also called Vendian biota) dating from 575 million years ago collectively constitutes a richly diverse assembly of early multicellular eukaryotes.

The fossil record and faunal succession form the basis of the science of biostratigraphy or determining the age of rocks based on embedded fossils. For the first 150 years of geology, biostratigraphy and superposition were the only means for determining the relative age of rocks. The geologic time scale was developed based on the relative ages of rock strata as determined by the early paleontologists and stratigraphers.

Since the early years of the twentieth century, absolute dating methods, such as radiometric dating (including potassium/argon, argon/argon, uranium series, and, for very recent fossils, radiocarbon dating) have been used to verify the relative ages obtained by fossils and to provide absolute ages for many fossils. Radiometric dating has shown that the earliest known stromatolites are over 3.4 billion years old.

Modern era

The fossil record is life's evolutionary epic that unfolded over four billion years as environmental conditions and genetic potential interacted in accordance with natural selection.

The Virtual Fossil Museum

Paleontology has joined with evolutionary biology to share the interdisciplinary task of outlining the tree of life, which inevitably leads backwards in time to Precambrian microscopic life when cell structure and functions evolved. Earth's deep time in the Proterozoic and deeper still in the Archean is only "recounted by microscopic fossils and subtle chemical signals." Molecular biologists, using phylogenetics, can compare protein amino acid or nucleotide sequence homology (i.e., similarity) to evaluate taxonomy and evolutionary distances among organisms, with limited statistical confidence. The study of fossils, on the other hand, can more specifically pinpoint when and in what organism a mutation first appeared. Phylogenetics and paleontology work together in the clarification of science's still dim view of the appearance of life and its evolution.

Phacopid trilobite Eldredgeops rana crassituberculata. The genus is named after Niles Eldredge

Crinoid columnals (Isocrinus nicoleti) from the Middle Jurassic Carmel Formation at Mount Carmel Junction, Utah

Niles Eldredge's study of the Phacops trilobite genus supported the hypothesis that modifications to the arrangement of the trilobite's eye lenses proceeded by fits and starts over millions of years during the Devonian. Eldredge's interpretation of the Phacops fossil record was that the aftermaths of the lens changes, but not the rapidly occurring evolutionary process, were fossilized. This and other data led Stephen Jay Gould and Niles Eldredge to publish their seminal paper on punctuated equilibrium in 1971.

Synchrotron X-ray tomographic analysis of early Cambrian bilaterian embryonic microfossils yielded new insights of metazoan evolution at its earliest stages. The tomography technique provides previously unattainable three-dimensional resolution at the limits of fossilization. Fossils of two enigmatic bilaterians, the worm-like Markuelia and a putative, primitive protostome, Pseudooides, provide a peek at germ layer embryonic development. These 543-million-year-old embryos support the emergence of some aspects of arthropod development earlier than previously thought in the late Proterozoic. The preserved embryos from China and Siberia underwent rapid diagenetic phosphatization resulting in exquisite preservation, including cell structures. This research is a notable example of how knowledge encoded by the fossil record continues to contribute otherwise unattainable information on the emergence and development of life on Earth. For example, the research suggests Markuelia has closest affinity to priapulid worms, and is adjacent to the evolutionary branching of Priapulida, Nematoda and Arthropoda.

Trading and collecting

Fossil trading is the practice of buying and selling fossils. This is many times done illegally with artifacts stolen from research sites, costing many important scientific specimens each year. The problem is quite pronounced in China, where many specimens have been stolen.

Fossil collecting (sometimes, in a non-scientific sense, fossil hunting) is the collection of fossils for scientific study, hobby, or profit. Fossil collecting, as practiced by amateurs, is the predecessor of modern paleontology and many still collect fossils and study fossils as amateurs. Professionals and amateurs alike collect fossils for their scientific value.

Gallery

Three small ammonite fossils, each approximately 1.5 cm across
Eocene fossil fish Priscacara liops from the Green River Formation of Wyoming
A permineralized trilobite, Asaphus kowalewskii
Megalodon and Carcharodontosaurus teeth. The latter was found in the Sahara Desert
Fossil shrimp (Cretaceous)
Petrified wood in Petrified Forest National Park, Arizona
Petrified cone of Araucaria mirabilis from Patagonia, Argentina dating from the Jurassic Period (approx. 210 Ma)
A fossil gastropod from the Pliocene of Cyprus. A serpulid worm is attached
Silurian Orthoceras fossil
Eocene fossil flower from Florissant, Colorado
Micraster echinoid fossil from England
Productid brachiopod ventral valve; Roadian, Guadalupian (Middle Permian); Glass Mountains, Texas
Agatized coral from the Hawthorn Group (Oligocene–Miocene), Florida. An example of preservation by replacement
Fossils from beaches of the Baltic Sea island of Gotland, placed on paper with 7 mm (0.28 inch) squares

Computational phylogenetics

From Wikipedia, the free encyclopedia

Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species and the relationships between specific genes shared by many types of organisms. Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed. The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.

Types of phylogenetic trees and networks

Phylogenetic trees generated by computational phylogenetics can be either rooted or unrooted depending on the input data and the algorithm used. A rooted tree is a directed graph that explicitly identifies a most recent common ancestor (MRCA), usually an imputed sequence that is not represented in the input. Genetic distance measures can be used to plot a tree with the input sequences as leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion in the input data of at least one "outgroup" known to be only distantly related to the sequences of interest.

By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.

The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.

Both rooted and unrooted phylogenetic trees can be further generalized to rooted or unrooted phylogenetic networks, which allow for the modeling of evolutionary phenomena such as hybridization or horizontal gene transfer.

Coding characters and defining homology

Morphological analysis

The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant. Morphological studies can be confounded by examples of convergent evolution of phenotypes. A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype's variation. The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records, but has been shown to have a significant effect on the trees produced; in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data.

Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.

Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.

Molecular analysis

The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment. For a given gapped MSA, several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignments. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.

UPGMA and WPGMA

The UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA (Weighted Pair Group Method with Arithmetic mean) methods produce rooted trees and require a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.

Neighbor-joining

Neighbor-joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages.

Fitch-Margoliash method

The Fitch-Margoliash method uses a weighted least squares method for clustering based on genetic distance. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be linear; the linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches. Another modification of the algorithm can be helpful, especially in case of concentrated distances (please report to concentration of measure phenomenon and curse of dimensionality): that modification, described in, has been shown to improve the efficiency of the algorithm and its robustness.

The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete, so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Using outgroups

Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set. This usage can be seen as a type of experimental control. If the outgroup has been appropriately chosen, it will have a much greater genetic distance and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds noise to the analysis. Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly conserved across lineages. Horizontal gene transfer, especially between otherwise divergent bacteria, can also confound outgroup usage.

Maximum parsimony

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.

The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard; consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the best in the set. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.

Branch and bound

The branch and bound algorithm is a general method used to increase the efficiency of searches for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980s. Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, thereby assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rules severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

Sankoff-Morel-Cedergren algorithm

The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences. The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thereby favoring the tree that introduces a minimal number of such events (an alternative view holds that the trees to be favored are those that maximize the amount of sequence similarity that can be interpreted as homology, a point of view that may lead to different optimal trees). The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.

MALIGN and POY

More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA. However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events. This, in turn, has been countered by the view that such methods should be seen as heuristic approaches to find the trees that maximize the amount of sequence similarity that can be interpreted as homology.

Maximum likelihood

The maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations; roughly, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. This is broadly similar to the maximum-parsimony method, but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites. In fact, the method requires that evolution at different sites and along different lineages must be statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but it is believed to be computationally intractable to compute due to its NP-hardness.

The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees. The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton-Raphson method are often used.

Bayesian inference

Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods.

Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step and swapping descendant subtrees of a random internal node between two related trees. The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work. Bayesian methods are generally held to be superior to parsimony-based methods; they can be more prone to long-branch attraction than maximum likelihood techniques, although they are better able to accommodate missing data.

Whereas likelihood methods find the tree that maximizes the probability of the data, a Bayesian approach recovers a tree that represents the most likely clades, by drawing on the posterior distribution. However, estimates of the posterior probability of clades (measuring their 'support') can be quite wide of the mark, especially in clades that aren't overwhelmingly likely. As such, other methods have been put forwards to estimate posterior probability.

Model selection

Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied. At their simplest, substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other (alternatively, the distance is linear only shortly before coalescence). The longer the amount of time after divergence, the more likely it becomes that two mutations occur at the same nucleotide site. Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history. The extent of this undercount increases with increasing time since divergence, which can lead to the phenomenon of long branch attraction, or the misassignment of two distantly related but convergently evolving sequences as closely related. The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events.

Types of models

All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate. More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages. One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.

Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code. A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution. Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.

Choosing the best model

The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit. The most common method of model selection is the likelihood ratio test (LRT), which produces a likelihood estimate that can be interpreted as a measure of "goodness of fit" between the model and the input data. However, care must be taken in using these results, since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model, which can lead to the naive selection of models that are overly complex. For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models. A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models; it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected.

An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback–Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models. The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.

A comprehensive step-by-step protocol on constructing phylogenetic tree, including DNA/Amino Acid contiguous sequence assembly, multiple sequence alignment, model-test (testing best-fitting substitution models) and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Nature Protocol

A non traditional way of evaluating the phylogenetic tree is to compare it with clustering result. One can use a Multidimensional Scaling technique, so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the sequences in 3D, and then map the phylogenetic tree onto the clustering result. A better tree usually has a higher correlation with the clustering result.

Evaluating tree support

As with all statistical analysis, the estimation of phylogenies from character data requires an evaluation of confidence. A number of methods exist to test the amount of support for a phylogenetic tree, either by evaluating the support for each sub-tree in the phylogeny (nodal support) or evaluating whether the phylogeny is significantly different from other possible trees (alternative tree hypothesis tests).

Nodal support

The most common method for assessing tree support is to evaluate the statistical support for each node on the tree. Typically, a node with very low support is not considered valid in further analysis, and visually may be collapsed into a polytomy to indicate that relationships within a clade are unresolved.

Consensus tree

Many methods for assessing nodal support involve consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees. In a *strict consensus,* only nodes found in every tree are shown, and the rest are collapsed into an unresolved polytomy. Less conservative methods, such as the *majority-rule consensus* tree, consider nodes that are supported by a given percentage of trees under consideration (such as at least 50%).

For example, in maximum parsimony analysis, there may be many trees with the same parsimony score. A strict consensus tree would show which nodes are found in all equally parsimonious trees, and which nodes differ. Consensus trees are also used to evaluate support on phylogenies reconstructed with Bayesian inference (see below).

Bootstrapping and jackknifing

In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, given a set of 100 data points, a pseudoreplicate is a data set of the same size (100 points) randomly sampled from the original data, with replacement. That is, each original data point may be represented more than once in the pseudoreplicate, or not at all. Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates.

In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node.

The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories, finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this was tested under ideal conditions (e.g. no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are generally supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved.

Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsampling the data—for example, a "10% jackknife" would involve randomly sampling 10% of the matrix many times to evaluate nodal support.

Posterior probability

Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single "best" tree. The trees in the posterior distribution generally have many different topologies. Most Bayesian inference methods utilize a Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded as burn-in. The most common method of evaluating nodal support in a Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution (post-burn-in) which contain the node.

The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model. Therefore, the threshold for accepting a node as supported is generally higher than for bootstrapping.

Step counting methods

Bremer support counts the number of extra steps needed to contradict a clade.

Shortcomings

These measures each have their weaknesses. For example, smaller or larger clades tend to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them.

Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade.

Limitations and workarounds

Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:

Homoplasy

Certain characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree. Weights in the form of a model of evolution can be inferred from sets of molecular data, so that maximum likelihood or Bayesian methods can be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As time since the divergence of two taxa increase, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only objective way to determine convergence is by the construction of a tree – a somewhat circular method. Even so, weighting homoplasious characters does indeed lead to better-supported trees. Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings almost guarantees placement among the pterygote insects because, although wings are often lost secondarily, there is no evidence that they have been gained more than once.

Horizontal gene transfer

In general, organisms can inherit genes in two ways: vertical gene transfer and horizontal gene transfer. Vertical gene transfer is the passage of genes from parent to offspring, and horizontal (also called lateral) gene transfer occurs when genes jump between unrelated organisms, a common phenomenon especially in prokaryotes; a good example of this is the acquired antibiotic resistance as a result of gene exchange between various bacteria leading to multi-drug-resistant bacterial species. There have also been well-documented cases of horizontal gene transfer between eukaryotes.
Horizontal gene transfer has complicated the determination of phylogenies of organisms, and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees. The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically; this requires analyzing a large number of genes.

Hybrids, speciation, introgressions and incomplete lineage sorting

The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown that hybrid speciation, once thought rare, is in fact quite common, particularly in plants. Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees. Introgression can also move genes between otherwise distinct species and sometimes even genera, complicating phylogenetic analysis based on genes. This phenomenon can contribute to "incomplete lineage sorting" and is thought to be a common phenomenon across a number of groups. In species level analysis this can be dealt with by larger sampling or better whole genome analysis. Often the problem is avoided by restricting the analysis to fewer, not closely related specimens.

Taxon sampling

Owing to the development of advanced sequencing techniques in molecular biology, it has become feasible to gather large amounts of data (DNA or amino acid sequences) to infer phylogenetic hypotheses. For example, it is not rare to find studies with character matrices based on whole mitochondrial genomes (~16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters, because the more taxa there are, the more accurate and more robust is the resulting phylogenetic tree. This may be partly due to the breaking up of long branches.

Phylogenetic signal

Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.

Continuous characters

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding. In the original form of gap coding:

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated ... and differences between adjacent means ... are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores ... if the means are separated by a "gap" greater than the within-group standard deviation ... times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.

Missing data

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.

The role of fossils

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa. Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record; stratocladistics incorporates age information into data matrices for phylogenetic analyses.

Search This Blog