A Medley of Potpourri

Tuesday, December 24, 2024

Occam's razor

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Occam%27s_razor

In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; Latin: novacula Occami) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle of parsimony or the law of parsimony (Latin: lex parsimoniae). Attributed to William of Ockham, a 14th-century English philosopher and theologian, it is frequently cited as Entia non sunt multiplicanda praeter necessitatem, which translates as "Entities must not be multiplied beyond necessity", although Occam never used these exact words. Popularly, the principle is sometimes paraphrased as "The simplest explanation is usually the best one."

This philosophical razor advocates that when presented with competing hypotheses about the same prediction and both hypotheses have equal explanatory power, one should prefer the hypothesis that requires the fewest assumptions, and that this is not meant to be a way of choosing between hypotheses that make different predictions. Similarly, in science, Occam's razor is used as an abductive heuristic in the development of theoretical models rather than as a rigorous arbiter between candidate models.

History

The phrase Occam's razor did not appear until a few centuries after William of Ockham's death in 1347. Libert Froidmont, in his On Christian Philosophy of the Soul, gives him credit for the phrase, speaking of "novacula occami". Ockham did not invent this principle, but its fame—and its association with him—may be due to the frequency and effectiveness with which he used it. Ockham stated the principle in various ways, but the most popular version, "Entities are not to be multiplied without necessity" (Non sunt multiplicanda entia sine necessitate) was formulated by the Irish Franciscan philosopher John Punch in his 1639 commentary on the works of Duns Scotus.

Formulations before William of Ockham

Part of a page from John Duns Scotus's book *Commentaria oxoniensia ad IV libros magistri Sententiarus*, showing the words: "*Pluralitas non est ponenda sine necessitate*", i.e., "Plurality is not to be posited without necessity"

The origins of what has come to be known as Occam's razor are traceable to the works of earlier philosophers such as John Duns Scotus (1265–1308), Robert Grosseteste (1175–1253), Maimonides (Moses ben-Maimon, 1138–1204), and even Aristotle (384–322 BC). Aristotle writes in his Posterior Analytics, "We may assume the superiority ceteris paribus [other things being equal] of the demonstration which derives from fewer postulates or hypotheses." Ptolemy (c. AD 90 – c. 168) stated, "We consider it a good principle to explain the phenomena by the simplest hypothesis possible."

Phrases such as "It is vain to do with more what can be done with fewer" and "A plurality is not to be posited without necessity" were commonplace in 13th-century scholastic writing. Robert Grosseteste, in Commentary on [Aristotle's] the Posterior Analytics Books (Commentarius in Posteriorum Analyticorum Libros) (c. 1217–1220), declares: "That is better and more valuable which requires fewer, other circumstances being equal... For if one thing were demonstrated from many and another thing from fewer equally known premises, clearly that is better which is from fewer because it makes us know quickly, just as a universal demonstration is better than particular because it produces knowledge from fewer premises. Similarly in natural science, in moral science, and in metaphysics the best is that which needs no premises and the better that which needs the fewer, other circumstances being equal."

The Summa Theologica of Thomas Aquinas (1225–1274) states that "it is superfluous to suppose that what can be accounted for by a few principles has been produced by many." Aquinas uses this principle to construct an objection to God's existence, an objection that he in turn answers and refutes generally (cf. quinque viae), and specifically, through an argument based on causality. Hence, Aquinas acknowledges the principle that today is known as Occam's razor, but prefers causal explanations to other simple explanations (cf. also Correlation does not imply causation).

William of Ockham

William of Ockham (circa 1287–1347) was an English Franciscan friar and theologian, an influential medieval philosopher and a nominalist. His popular fame as a great logician rests chiefly on the maxim attributed to him and known as Occam's razor. The term razor refers to distinguishing between two hypotheses either by "shaving away" unnecessary assumptions or cutting apart two similar conclusions.

While it has been claimed that Occam's razor is not found in any of William's writings, one can cite statements such as Numquam ponenda est pluralitas sine necessitate ("Plurality must never be posited without necessity"), which occurs in his theological work on the Sentences of Peter Lombard (Quaestiones et decisiones in quattuor libros Sententiarum Petri Lombardi; ed. Lugd., 1495, i, dist. 27, qu. 2, K).

Nevertheless, the precise words sometimes attributed to William of Ockham, Entia non sunt multiplicanda praeter necessitatem (Entities must not be multiplied beyond necessity), are absent in his extant works; this particular phrasing comes from John Punch, who described the principle as a "common axiom" (axioma vulgare) of the Scholastics. William of Ockham himself seems to restrict the operation of this principle in matters pertaining to miracles and God's power, considering a plurality of miracles possible in the Eucharist simply because it pleases God.

This principle is sometimes phrased as Pluralitas non est ponenda sine necessitate ("Plurality should not be posited without necessity"). In his Summa Totius Logicae, i. 12, William of Ockham cites the principle of economy, Frustra fit per plura quod potest fieri per pauciora ("It is futile to do with more things that which can be done with fewer"; Thorburn, 1918, pp. 352–53; Kneale and Kneale, 1962, p. 243.)

Later formulations

To quote Isaac Newton, "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, as far as possible, assign the same causes."In the sentence hypotheses non fingo, Newton affirms the success of this approach.

Bertrand Russell offers a particular version of Occam's razor: "Whenever possible, substitute constructions out of known entities for inferences to unknown entities."

Around 1960, Ray Solomonoff founded the theory of universal inductive inference, the theory of prediction based on observations – for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. This theory is a mathematical formalization of Occam's razor.

Another technical approach to Occam's razor is ontological parsimony. Parsimony means spareness and is also referred to as the Rule of Simplicity. This is considered a strong version of Occam's razor. A variation used in medicine is called the "Zebra": a physician should reject an exotic medical diagnosis when a more commonplace explanation is more likely, derived from Theodore Woodward's dictum "When you hear hoofbeats, think of horses not zebras".

Ernst Mach formulated the stronger version of Occam's razor into physics, which he called the Principle of Economy stating: "Scientists must use the simplest means of arriving at their results and exclude everything not perceived by the senses."

This principle goes back at least as far as Aristotle, who wrote "Nature operates in the shortest way possible." The idea of parsimony or simplicity in deciding between theories, though not the intent of the original expression of Occam's razor, has been assimilated into common culture as the widespread layman's formulation that "the simplest explanation is usually the correct one."

Justifications

Aesthetic

Prior to the 20th century, it was a commonly held belief that nature itself was simple and that simpler hypotheses about nature were thus more likely to be true. This notion was deeply rooted in the aesthetic value that simplicity holds for human thought and the justifications presented for it often drew from theology. Thomas Aquinas made this argument in the 13th century, writing, "If a thing can be done adequately by means of one, it is superfluous to do it by means of several; for we observe that nature does not employ two instruments [if] one suffices."

Beginning in the 20th century, epistemological justifications based on induction, logic, pragmatism, and especially probability theory have become more popular among philosophers.

Empirical

Occam's razor has gained strong empirical support in helping to converge on better theories (see Uses section below for some examples).

In the related concept of overfitting, excessively complex models are affected by statistical noise (a problem also known as the bias–variance tradeoff), whereas simpler models may capture the underlying structure better and may thus have better predictive performance. It is, however, often difficult to deduce which part of the data is noise (cf. model selection, test set, minimum description length, Bayesian inference, etc.).

Testing the razor

The razor's statement that "other things being equal, simpler explanations are generally better than more complex ones" is amenable to empirical testing. Another interpretation of the razor's statement would be that "simpler hypotheses are generally better than the complex ones". The procedure to test the former interpretation would compare the track records of simple and comparatively complex explanations. If one accepts the first interpretation, the validity of Occam's razor as a tool would then have to be rejected if the more complex explanations were more often correct than the less complex ones (while the converse would lend support to its use). If the latter interpretation is accepted, the validity of Occam's razor as a tool could possibly be accepted if the simpler hypotheses led to correct conclusions more often than not.

Even if some increases in complexity are sometimes necessary, there still remains a justified general bias toward the simpler of two competing explanations. To understand why, consider that for each accepted explanation of a phenomenon, there is always an infinite number of possible, more complex, and ultimately incorrect, alternatives. This is so because one can always burden a failing explanation with an ad hoc hypothesis. Ad hoc hypotheses are justifications that prevent theories from being falsified.

For example, if a man, accused of breaking a vase, makes supernatural claims that leprechauns were responsible for the breakage, a simple explanation might be that the man did it, but ongoing ad hoc justifications (e.g., "... and that's not me breaking it on the film; they tampered with that, too") could successfully prevent complete disproof. This endless supply of elaborate competing explanations, called saving hypotheses, cannot be technically ruled out – except by using Occam's razor.

Any more complex theory might still possibly be true. A study of the predictive validity of Occam's razor found 32 published papers that included 97 comparisons of economic forecasts from simple and complex forecasting methods. None of the papers provided a balance of evidence that complexity of method improved forecast accuracy. In the 25 papers with quantitative comparisons, complexity increased forecast errors by an average of 27 percent.

Practical considerations and pragmatism

Mathematical

One justification of Occam's razor is a direct result of basic probability theory. By definition, all assumptions introduce possibilities for error; if an assumption does not improve the accuracy of a theory, its only effect is to increase the probability that the overall theory is wrong.

There have also been other attempts to derive Occam's razor from probability theory, including notable attempts made by Harold Jeffreys and E. T. Jaynes. The probabilistic (Bayesian) basis for Occam's razor is elaborated by David J. C. MacKay in chapter 28 of his book Information Theory, Inference, and Learning Algorithms, where he emphasizes that a prior bias in favor of simpler models is not required.

William H. Jefferys and James O. Berger (1991) generalize and quantify the original formulation's "assumptions" concept as the degree to which a proposition is unnecessarily accommodating to possible observable data. They state, "A hypothesis with fewer adjustable parameters will automatically have an enhanced posterior probability, due to the fact that the predictions it makes are sharp." The use of "sharp" here is not only a tongue-in-cheek reference to the idea of a razor, but also indicates that such predictions are more accurate than competing predictions. The model they propose balances the precision of a theory's predictions against their sharpness, preferring theories that sharply make correct predictions over theories that accommodate a wide range of other possible results. This, again, reflects the mathematical relationship between key concepts in Bayesian inference (namely marginal probability, conditional probability, and posterior probability).

The bias–variance tradeoff is a framework that incorporates the Occam's razor principle in its balance between overfitting (associated with lower bias but higher variance) and underfitting (associated with lower variance but higher bias).

Other philosophers

Karl Popper

Karl Popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations. Our preference for simplicity may be justified by its falsifiability criterion: we prefer simpler theories to more complex ones "because their empirical content is greater; and because they are better testable". The idea here is that a simple theory applies to more cases than a more complex one, and is thus more easily falsifiable. This is again comparing a simple theory to a more complex theory where both explain the data equally well.

Elliott Sober

The philosopher of science Elliott Sober once argued along the same lines as Popper, tying simplicity with "informativeness": The simplest theory is the more informative, in the sense that it requires less information to a question. He has since rejected this account of simplicity, purportedly because it fails to provide an epistemic justification for simplicity. He now believes that simplicity considerations (and considerations of parsimony in particular) do not count unless they reflect something more fundamental. Philosophers, he suggests, may have made the error of hypostatizing simplicity (i.e., endowed it with a sui generis existence), when it has meaning only when embedded in a specific context (Sober 1992). If we fail to justify simplicity considerations on the basis of the context in which we use them, we may have no non-circular justification: "Just as the question 'why be rational?' may have no non-circular answer, the same may be true of the question 'why should simplicity be considered in evaluating the plausibility of hypotheses?'"

Richard Swinburne

Richard Swinburne argues for simplicity on logical grounds:

... the simplest hypothesis proposed as an explanation of phenomena is more likely to be the true one than is any other available hypothesis, that its predictions are more likely to be true than those of any other available hypothesis, and that it is an ultimate a priori epistemic principle that simplicity is evidence for truth.
— Swinburne 1997

According to Swinburne, since our choice of theory cannot be determined by data (see Underdetermination and Duhem–Quine thesis), we must rely on some criterion to determine which theory to use. Since it is absurd to have no logical method for settling on one hypothesis amongst an infinite number of equally data-compliant hypotheses, we should choose the simplest theory: "Either science is irrational [in the way it judges theories and predictions probable] or the principle of simplicity is a fundamental synthetic a priori truth."

Ludwig Wittgenstein

From the Tractatus Logico-Philosophicus:

3.328 "If a sign is not necessary then it is meaningless. That is the meaning of Occam's Razor."

(If everything in the symbolism works as though a sign had meaning, then it has meaning.)

4.04 "In the proposition, there must be exactly as many things distinguishable as there are in the state of affairs, which it represents. They must both possess the same logical (mathematical) multiplicity (cf. Hertz's Mechanics, on Dynamic Models)."
5.47321 "Occam's Razor is, of course, not an arbitrary rule nor one justified by its practical success. It simply says that unnecessary elements in a symbolism mean nothing. Signs which serve one purpose are logically equivalent; signs which serve no purpose are logically meaningless."

and on the related concept of "simplicity":

6.363 "The procedure of induction consists in accepting as true the simplest law that can be reconciled with our experiences."

Uses

Science and the scientific method

Andreas Cellarius's illustration of the Copernican system, from the *Harmonia Macrocosmica* (1660). Future positions of the sun, moon and other solar system bodies can be calculated using a geocentric model (the earth is at the centre) or using a heliocentric model (the sun is at the centre). Both work, but the geocentric model requires a much more complex system of calculations than the heliocentric model. This was pointed out in a preface to Copernicus's first edition of *De revolutionibus orbium coelestium*.

In science, Occam's razor is used as a heuristic to guide scientists in developing theoretical models rather than as an arbiter between published models. In physics, parsimony was an important heuristic in the development and application of the principle of least action by Pierre Louis Maupertuis and Leonhard Euler,^[43] in Albert Einstein's formulation of special relativity, and in the development of quantum mechanics by Max Planck, Werner Heisenberg and Louis de Broglie.

In chemistry, Occam's razor is often an important heuristic when developing a model of a reaction mechanism. Although it is useful as a heuristic in developing models of reaction mechanisms, it has been shown to fail as a criterion for selecting among some selected published models. In this context, Einstein himself expressed caution when he formulated Einstein's Constraint: "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." An often-quoted version of this constraint (which cannot be verified as posited by Einstein himself) reduces this to "Everything should be kept as simple as possible, but not simpler."

In the scientific method, Occam's razor is not considered an irrefutable principle of logic or a scientific result; the preference for simplicity in the scientific method is based on the falsifiability criterion. For each accepted explanation of a phenomenon, there may be an extremely large, perhaps even incomprehensible, number of possible and more complex alternatives. Since failing explanations can always be burdened with ad hoc hypotheses to prevent them from being falsified, simpler theories are preferable to more complex ones because they tend to be more testable. As a logical principle, Occam's razor would demand that scientists accept the simplest possible theoretical explanation for existing data. However, science has shown repeatedly that future data often support more complex theories than do existing data. Science prefers the simplest explanation that is consistent with the data available at a given time, but the simplest explanation may be ruled out as new data become available. That is, science is open to the possibility that future experiments might support more complex theories than demanded by current data and is more interested in designing experiments to discriminate between competing theories than favoring one theory over another based merely on philosophical principles.

When scientists use the idea of parsimony, it has meaning only in a very specific context of inquiry. Several background assumptions are required for parsimony to connect with plausibility in a particular research problem. The reasonableness of parsimony in one research context may have nothing to do with its reasonableness in another. It is a mistake to think that there is a single global principle that spans diverse subject matter.

It has been suggested that Occam's razor is a widely accepted example of extraevidential consideration, even though it is entirely a metaphysical assumption. Most of the time, however, Occam's razor is a conservative tool, cutting out "crazy, complicated constructions" and assuring "that hypotheses are grounded in the science of the day", thus yielding "normal" science: models of explanation and prediction. There are, however, notable exceptions where Occam's razor turns a conservative scientist into a reluctant revolutionary. For example, Max Planck interpolated between the Wien and Jeans radiation laws and used Occam's razor logic to formulate the quantum hypothesis, even resisting that hypothesis as it became more obvious that it was correct.

Appeals to simplicity were used to argue against the phenomena of meteorites, ball lightning, continental drift, and reverse transcriptase. One can argue for atomic building blocks for matter, because it provides a simpler explanation for the observed reversibility of both mixingand chemical reactions as simple separation and rearrangements of atomic building blocks. At the time, however, the atomic theory was considered more complex because it implied the existence of invisible particles that had not been directly detected. Ernst Mach and the logical positivists rejected John Dalton's atomic theory until the reality of atoms was more evident in Brownian motion, as shown by Albert Einstein.

In the same way, postulating the aether is more complex than transmission of light through a vacuum. At the time, however, all known waves propagated through a physical medium, and it seemed simpler to postulate the existence of a medium than to theorize about wave propagation without a medium. Likewise, Isaac Newton's idea of light particles seemed simpler than Christiaan Huygens's idea of waves, so many favored it. In this case, as it turned out, neither the wave—nor the particle—explanation alone suffices, as light behaves like waves and like particles.

Three axioms presupposed by the scientific method are realism (the existence of objective reality), the existence of natural laws, and the constancy of natural law. Rather than depend on provability of these axioms, science depends on the fact that they have not been objectively falsified. Occam's razor and parsimony support, but do not prove, these axioms of science. The general principle of science is that theories (or models) of natural law must be consistent with repeatable experimental observations. This ultimate arbiter (selection criterion) rests upon the axioms mentioned above.

If multiple models of natural law make exactly the same testable predictions, they are equivalent and there is no need for parsimony to choose a preferred one. For example, Newtonian, Hamiltonian and Lagrangian classical mechanics are equivalent. Physicists have no interest in using Occam's razor to say the other two are wrong. Likewise, there is no demand for simplicity principles to arbitrate between wave and matrix formulations of quantum mechanics. Science often does not demand arbitration or selection criteria between models that make the same testable predictions.

Biology

Biologists or philosophers of biology use Occam's razor in either of two contexts both in evolutionary biology: the units of selection controversy and systematics. George C. Williams in his book Adaptation and Natural Selection (1966) argues that the best way to explain altruism among animals is based on low-level (i.e., individual) selection as opposed to high-level group selection. Altruism is defined by some evolutionary biologists (e.g., R. Alexander, 1987; W. D. Hamilton, 1964) as behavior that is beneficial to others (or to the group) at a cost to the individual, and many posit individual selection as the mechanism that explains altruism solely in terms of the behaviors of individual organisms acting in their own self-interest (or in the interest of their genes, via kin selection). Williams was arguing against the perspective of others who propose selection at the level of the group as an evolutionary mechanism that selects for altruistic traits (e.g., D. S. Wilson & E. O. Wilson, 2007). The basis for Williams's contention is that of the two, individual selection is the more parsimonious theory. In doing so he is invoking a variant of Occam's razor known as Morgan's Canon: "In no case is an animal activity to be interpreted in terms of higher psychological processes, if it can be fairly interpreted in terms of processes which stand lower in the scale of psychological evolution and development." (Morgan 1903).

However, more recent biological analyses, such as Richard Dawkins's The Selfish Gene, have contended that Morgan's Canon is not the simplest and most basic explanation. Dawkins argues the way evolution works is that the genes propagated in most copies end up determining the development of that particular species, i.e., natural selection turns out to select specific genes, and this is really the fundamental underlying principle that automatically gives individual and group selection as emergent features of evolution.

Zoology provides an example. Muskoxen, when threatened by wolves, form a circle with the males on the outside and the females and young on the inside. This is an example of a behavior by the males that seems to be altruistic. The behavior is disadvantageous to them individually but beneficial to the group as a whole; thus, it was seen by some to support the group selection theory. Another interpretation is kin selection: if the males are protecting their offspring, they are protecting copies of their own alleles. Engaging in this behavior would be favored by individual selection if the cost to the male musk ox is less than half of the benefit received by his calf – which could easily be the case if wolves have an easier time killing calves than adult males. It could also be the case that male musk oxen would be individually less likely to be killed by wolves if they stood in a circle with their horns pointing out, regardless of whether they were protecting the females and offspring. That would be an example of regular natural selection – a phenomenon called "the selfish herd".

Systematics is the branch of biology that attempts to establish patterns of relationship among biological taxa, today generally thought to reflect evolutionary history. It is also concerned with their classification. There are three primary camps in systematics: cladists, pheneticists, and evolutionary taxonomists. Cladists hold that classification should be based on synapomorphies (shared, derived character states), pheneticists contend that overall similarity (synapomorphies and complementary symplesiomorphies) is the determining criterion, while evolutionary taxonomists say that both genealogy and similarity count in classification (in a manner determined by the evolutionary taxonomist).

It is among the cladists that Occam's razor is applied, through the method of cladistic parsimony. Cladistic parsimony (or maximum parsimony) is a method of phylogenetic inference that yields phylogenetic trees (more specifically, cladograms). Cladograms are branching, diagrams used to represent hypotheses of relative degree of relationship, based on synapomorphies. Cladistic parsimony is used to select as the preferred hypothesis of relationships the cladogram that requires the fewest implied character state transformations (or smallest weight, if characters are differentially weighted). Critics of the cladistic approach often observe that for some types of data, parsimony could produce the wrong results, regardless of how much data is collected (this is called statistical inconsistency, or long branch attraction). However, this criticism is also potentially true for any type of phylogenetic inference, unless the model used to estimate the tree reflects the way that evolution actually happened. Because this information is not empirically accessible, the criticism of statistical inconsistency against parsimony holds no force. For a book-length treatment of cladistic parsimony, see Elliott Sober's Reconstructing the Past: Parsimony, Evolution, and Inference (1988). For a discussion of both uses of Occam's razor in biology, see Sober's article "Let's Razor Ockham's Razor" (1990).

Other methods for inferring evolutionary relationships use parsimony in a more general way. Likelihood methods for phylogeny use parsimony as they do for all likelihood tests, with hypotheses requiring fewer differing parameters (i.e., numbers or different rates of character change or different frequencies of character state transitions) being treated as null hypotheses relative to hypotheses requiring more differing parameters. Thus, complex hypotheses must predict data much better than do simple hypotheses before researchers reject the simple hypotheses. Recent advances employ information theory, a close cousin of likelihood, which uses Occam's razor in the same way. The choice of the "shortest tree" relative to a not-so-short tree under any optimality criterion (smallest distance, fewest steps, or maximum likelihood) is always based on parsimony.

Francis Crick has commented on potential limitations of Occam's razor in biology. He advances the argument that because biological systems are the products of (an ongoing) natural selection, the mechanisms are not necessarily optimal in an obvious sense. He cautions: "While Ockham's razor is a useful tool in the physical sciences, it can be a very dangerous implement in biology. It is thus very rash to use simplicity and elegance as a guide in biological research." This is an ontological critique of parsimony.

In biogeography, parsimony is used to infer ancient vicariant events or migrations of species or populations by observing the geographic distribution and relationships of existing organisms. Given the phylogenetic tree, ancestral population subdivisions are inferred to be those that require the minimum amount of change.

Religion

In the philosophy of religion, Occam's razor is sometimes applied to the existence of God. William of Ockham himself was a Christian. He believed in God, and in the authority of Christian scripture; he writes that "nothing ought to be posited without a reason given, unless it is self-evident (literally, known through itself) or known by experience or proved by the authority of Sacred Scripture." Ockham believed that an explanation has no sufficient basis in reality when it does not harmonize with reason, experience, or the Bible. Unlike many theologians of his time, though, Ockham did not believe God could be logically proven with arguments. To Ockham, science was a matter of discovery; theology was a matter of revelation and faith. He states: "Only faith gives us access to theological truths. The ways of God are not open to reason, for God has freely chosen to create a world and establish a way of salvation within it apart from any necessary laws that human logic or rationality can uncover."

Thomas Aquinas, in the Summa Theologica, uses a formulation of Occam's razor to construct an objection to the idea that God exists, which he refutes directly with a counterargument:

Further, it is superfluous to suppose that what can be accounted for by a few principles has been produced by many. But it seems that everything we see in the world can be accounted for by other principles, supposing God did not exist. For all natural things can be reduced to one principle which is nature; and all voluntary things can be reduced to one principle which is human reason, or will. Therefore there is no need to suppose God's existence.

In turn, Aquinas answers this with the quinque viae, and addresses the particular objection above with the following answer:

Since nature works for a determinate end under the direction of a higher agent, whatever is done by nature must needs be traced back to God, as to its first cause. So also whatever is done voluntarily must also be traced back to some higher cause other than human reason or will, since these can change or fail; for all things that are changeable and capable of defect must be traced back to an immovable and self-necessary first principle, as was shown in the body of the Article.

Rather than argue for the necessity of a god, some theists base their belief upon grounds independent of, or prior to, reason, making Occam's razor irrelevant. This was the stance of Søren Kierkegaard, who viewed belief in God as a leap of faith that sometimes directly opposed reason. This is also the doctrine of Gordon Clark's presuppositional apologetics, with the exception that Clark never thought the leap of faith was contrary to reason (see also Fideism).

Various arguments in favor of God establish God as a useful or even necessary assumption. Contrastingly some anti-theists hold firmly to the belief that assuming the existence of God introduces unnecessary complexity (e.g., the Ultimate Boeing 747 gambit from Dawkins's The God Delusion).

Another application of the principle is to be found in the work of George Berkeley (1685–1753). Berkeley was an idealist who believed that all of reality could be explained in terms of the mind alone. He invoked Occam's razor against materialism, stating that matter was not required by his metaphysics and was thus eliminable. One potential problem with this belief is that it's possible, given Berkeley's position, to find solipsism itself more in line with the razor than a God-mediated world beyond a single thinker.

Occam's razor may also be recognized in the apocryphal story about an exchange between Pierre-Simon Laplace and Napoleon. It is said that in praising Laplace for one of his recent publications, the emperor asked how it was that the name of God, which featured so frequently in the writings of Lagrange, appeared nowhere in Laplace's. At that, he is said to have replied, "It's because I had no need of that hypothesis." Though some points of this story illustrate Laplace's atheism, more careful consideration suggests that he may instead have intended merely to illustrate the power of methodological naturalism, or even simply that the fewer logical premises one assumes, the stronger is one's conclusion.

Philosophy of mind

In his article "Sensations and Brain Processes" (1959), J. J. C. Smart invoked Occam's razor with the aim to justify his preference of the mind-brain identity theory over spirit-body dualism. Dualists state that there are two kinds of substances in the universe: physical (including the body) and spiritual, which is non-physical. In contrast, identity theorists state that everything is physical, including consciousness, and that there is nothing nonphysical. Though it is impossible to appreciate the spiritual when limiting oneself to the physical, Smart maintained that identity theory explains all phenomena by assuming only a physical reality. Subsequently, Smart has been severely criticized for his use (or misuse) of Occam's razor and ultimately retracted his advocacy of it in this context. Paul Churchland (1984) states that by itself Occam's razor is inconclusive regarding duality. In a similar way, Dale Jacquette (1994) stated that Occam's razor has been used in attempts to justify eliminativism and reductionism in the philosophy of mind. Eliminativism is the thesis that the ontology of folk psychology including such entities as "pain", "joy", "desire", "fear", etc., are eliminable in favor of an ontology of a completed neuroscience.

Penal ethics

In penal theory and the philosophy of punishment, parsimony refers specifically to taking care in the distribution of punishment in order to avoid excessive punishment. In the utilitarian approach to the philosophy of punishment, Jeremy Bentham's "parsimony principle" states that any punishment greater than is required to achieve its end is unjust. The concept is related but not identical to the legal concept of proportionality. Parsimony is a key consideration of the modern restorative justice, and is a component of utilitarian approaches to punishment, as well as the prison abolition movement. Bentham believed that true parsimony would require punishment to be individualised to take account of the sensibility of the individual—an individual more sensitive to punishment should be given a proportionately lesser one, since otherwise needless pain would be inflicted. Later utilitarian writers have tended to abandon this idea, in large part due to the impracticality of determining each alleged criminal's relative sensitivity to specific punishments.

Probability theory and statistics

Marcus Hutter's universal artificial intelligence builds upon Solomonoff's mathematical formalization of the razor to calculate the expected value of an action.

There are various papers in scholarly journals deriving formal versions of Occam's razor from probability theory, applying it in statistical inference, and using it to come up with criteria for penalizing complexity in statistical inference. Papers have suggested a connection between Occam's razor and Kolmogorov complexity.

One of the problems with the original formulation of the razor is that it only applies to models with the same explanatory power (i.e., it only tells us to prefer the simplest of equally good models). A more general form of the razor can be derived from Bayesian model comparison, which is based on Bayes factors and can be used to compare models that do not fit the observations equally well. These methods can sometimes optimally balance the complexity and power of a model. Generally, the exact Occam factor is intractable, but approximations such as Akaike information criterion, Bayesian information criterion, Variational Bayesian methods, false discovery rate, and Laplace's method are used. Many artificial intelligence researchers are now employing such techniques, for instance through work on Occam Learning or more generally on the Free energy principle.

Statistical versions of Occam's razor have a more rigorous formulation than what philosophical discussions produce. In particular, they must have a specific definition of the term simplicity, and that definition can vary. For example, in the Kolmogorov–Chaitin minimum description length approach, the subject must pick a Turing machine whose operations describe the basic operations believed to represent "simplicity" by the subject. However, one could always choose a Turing machine with a simple operation that happened to construct one's entire theory and would hence score highly under the razor. This has led to two opposing camps: one that believes Occam's razor is objective, and one that believes it is subjective.

Objective razor

The minimum instruction set of a universal Turing machine requires approximately the same length description across different formulations, and is small compared to the Kolmogorov complexity of most practical theories. Marcus Hutter has used this consistency to define a "natural" Turing machine of small size as the proper basis for excluding arbitrarily complex instruction sets in the formulation of razors. Describing the program for the universal program as the "hypothesis", and the representation of the evidence as program data, it has been formally proven under Zermelo–Fraenkel set theory that "the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized." Interpreting this as minimising the total length of a two-part message encoding model followed by data given model gives us the minimum message length (MML) principle.

One possible conclusion from mixing the concepts of Kolmogorov complexity and Occam's razor is that an ideal data compressor would also be a scientific explanation/formulation generator. Some attempts have been made to re-derive known laws from considerations of simplicity or compressibility.

According to Jürgen Schmidhuber, the appropriate mathematical theory of Occam's razor already exists, namely, Solomonoff's theory of optimal inductive inference and its extensions. See discussions in David L. Dowe's "Foreword re C. S. Wallace" for the subtle distinctions between the algorithmic probability work of Solomonoff and the MML work of Chris Wallace, and see Dowe's "MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness" both for such discussions and for (in section 4) discussions of MML and Occam's razor. For a specific example of MML as Occam's razor in the problem of decision tree induction, see Dowe and Needham's "Message Length as an Effective Ockham's Razor in Decision Tree Induction".

Mathematical arguments against Occam's razor

The no free lunch (NFL) theorems for inductive inference prove that Occam's razor must rely on ultimately arbitrary assumptions concerning the prior probability distribution found in our world. Specifically, suppose one is given two inductive inference algorithms, A and B, where A is a Bayesian procedure based on the choice of some prior distribution motivated by Occam's razor (e.g., the prior might favor hypotheses with smaller Kolmogorov complexity). Suppose that B is the anti-Bayes procedure, which calculates what the Bayesian algorithm A based on Occam's razor will predict – and then predicts the exact opposite. Then there are just as many actual priors (including those different from the Occam's razor prior assumed by A) in which algorithm B outperforms A as priors in which the procedure A based on Occam's razor comes out on top. In particular, the NFL theorems show that the "Occam factors" Bayesian argument for Occam's razor must make ultimately arbitrary modeling assumptions.

Software development

In software development, the rule of least power argues the correct programming language to use is the one that is simplest while also solving the targeted software problem. In that form the rule is often credited to Tim Berners-Lee since it appeared in his design guidelines for the original Hypertext Transfer Protocol. Complexity in this context is measured either by placing a language into the Chomsky hierarchy or by listing idiomatic features of the language and comparing according to some agreed to scale of difficulties between idioms. Many languages once thought to be of lower complexity have evolved or later been discovered to be more complex than originally intended; so, in practice this rule is applied to the relative ease of a programmer to obtain the power of the language, rather than the precise theoretical limits of the language.

Controversial aspects

Occam's razor is not an embargo against the positing of any kind of entity, or a recommendation of the simplest theory come what may. Occam's razor is used to adjudicate between theories that have already passed "theoretical scrutiny" tests and are equally well-supported by evidence. Furthermore, it may be used to prioritize empirical testing between two equally plausible but unequally testable hypotheses; thereby minimizing costs and wastes while increasing chances of falsification of the simpler-to-test hypothesis.

Another contentious aspect of the razor is that a theory can become more complex in terms of its structure (or syntax), while its ontology (or semantics) becomes simpler, or vice versa. Quine, in a discussion on definition, referred to these two perspectives as "economy of practical expression" and "economy in grammar and vocabulary", respectively.

Galileo Galilei lampooned the misuse of Occam's razor in his Dialogue. The principle is represented in the dialogue by Simplicio. The telling point that Galileo presented ironically was that if one really wanted to start from a small number of entities, one could always consider the letters of the alphabet as the fundamental entities, since one could construct the whole of human knowledge out of them.

Instances of using Occam's razor to justify belief in less complex and more simple theories have been criticized as using the razor inappropriately. For instance Francis Crick stated that "While Occam's razor is a useful tool in the physical sciences, it can be a very dangerous implement in biology. It is thus very rash to use simplicity and elegance as a guide in biological research."

Anti-razors

Occam's razor has met some opposition from people who consider it too extreme or rash. Walter Chatton (c. 1290–1343) was a contemporary of William of Ockham who took exception to Occam's razor and Ockham's use of it. In response he devised his own anti-razor: "If three things are not enough to verify an affirmative proposition about things, a fourth must be added and so on." Although there have been several philosophers who have formulated similar anti-razors since Chatton's time, no one anti-razor has perpetuated as notably as Chatton's anti-razor, although this could be the case of the Late Renaissance Italian motto of unknown attribution Se non è vero, è ben trovato ("Even if it is not true, it is well conceived") when referred to a particularly artful explanation.

Anti-razors have also been created by Gottfried Wilhelm Leibniz (1646–1716), Immanuel Kant (1724–1804), and Karl Menger (1902–1985). Leibniz's version took the form of a principle of plenitude, as Arthur Lovejoy has called it: the idea being that God created the most varied and populous of possible worlds. Kant felt a need to moderate the effects of Occam's razor and thus created his own counter-razor: "The variety of beings should not rashly be diminished."

Karl Menger found mathematicians to be too parsimonious with regard to variables so he formulated his Law Against Miserliness, which took one of two forms: "Entities must not be reduced to the point of inadequacy" and "It is vain to do with fewer what requires more." A less serious but even more extremist anti-razor is 'Pataphysics, the "science of imaginary solutions" developed by Alfred Jarry (1873–1907). Perhaps the ultimate in anti-reductionism, "'Pataphysics seeks no less than to view each event in the universe as completely unique, subject to no laws but its own." Variations on this theme were subsequently explored by the Argentine writer Jorge Luis Borges in his story/mock-essay "Tlön, Uqbar, Orbis Tertius". Physicist R. V. Jones contrived Crabtree's Bludgeon, which states that "[n]o set of mutually inconsistent observations can exist for which some human intellect cannot conceive a coherent explanation, however complicated."

Recently, American physicist Igor Mazin argued that because high-profile physics journals prefer publications offering exotic and unusual interpretations, the Occam's razor principle is being replaced by an "Inverse Occam's razor", implying that the simplest possible explanation is usually rejected.

Other

Since 2012, The Skeptic magazine annually awards the Ockham Awards, or simply the Ockhams, named after Occam's razor, at QED. The Ockhams were introduced by editor-in-chief Deborah Hyde to "recognise the effort and time that have gone into the community's favourite skeptical blogs, skeptical podcasts, skeptical campaigns and outstanding contributors to the skeptical cause." The trophies, designed by Neil Davies and Karl Derrick, carry the upper text "Ockham's" and the lower text "The Skeptic. Shaving away unnecessary assumptions since 1285." Between the texts, there is an image of a double-edged safety razorblade, and both lower corners feature an image of William of Ockham's face.

Monday, December 23, 2024

Bayesian inference

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Bayesian_inference

Bayesian inference (/ˈbeɪziən/ BAY-zee-ən or /ˈbeɪʒən/ BAY-zhən) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Introduction to Bayes' rule

Formal explanation

Contingency table
Hypothesis Evidence	Satisfies hypothesis H	Violates hypothesis ¬H	Total
Has evidence E	P(H\|E)·P(E) = P(E\|H)·P(H)	P(¬H\|E)·P(E) = P(E\|¬H)·P(¬H)	P(E)
No evidence ¬E	P(H\|¬E)·P(¬E) = P(¬E\|H)·P(H)	P(¬H\|¬E)·P(¬E) = P(¬E\|¬H)·P(¬H)	P(¬E) = 1−P(E)

Total	P(H)	P(¬H) = 1−P(H)	1

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem: $P (H ∣ E) = \frac{P (E ∣ H) \cdot P (H)}{P (E)},$ where

$H$ stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
$P (H)$ , the prior probability, is the estimate of the probability of the hypothesis $H$ before the data $E$ , the current evidence, is observed.
$E$ , the evidence, corresponds to new data that were not used in computing the prior probability.
$P (H ∣ E)$ , the posterior probability, is the probability of $H$ given $E$ , i.e., after $E$ is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
$P (E ∣ H)$ is the probability of observing $E$ given $H$ and is called the likelihood. As a function of $E$ with $H$ fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, $E$ , while the posterior probability is a function of the hypothesis, $H$ .
$P (E)$ is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis $H$ does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses.
$P (E) > 0$ (Else one has $0 / 0$ .)

For different values of $H$ , only the factors $P (H)$ and $P (E ∣ H)$ , both in the numerator, affect the value of $P (H ∣ E)$ – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

In cases where $\neg H$ ("not $H$ "), the logical negation of $H$ , is a valid likelihood, Bayes' rule can be rewritten as follows: $\begin{aligned} P (H ∣ E) & = \frac{P (E ∣ H) P (H)}{P (E)} \\ = \frac{P (E ∣ H) P (H)}{P (E ∣ H) P (H) + P (E ∣ \neg H) P (\neg H)} \\ = \frac{1}{1 + (\frac{1}{P (H)} - 1) \frac{P (E ∣ \neg H)}{P (E ∣ H)}} \end{aligned}$ because $P (E) = P (E ∣ H) P (H) + P (E ∣ \neg H) P (\neg H)$ and $P (H) + P (\neg H) = 1.$ This focuses attention on the term $(\frac{1}{P (H)} - 1) \frac{P (E ∣ \neg H)}{P (E ∣ H)} .$ If that term is approximately 1, then the probability of the hypothesis given the evidence, $P (H ∣ E)$ , is about $\frac{1}{2}$ , about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence, $P (H ∣ E)$ is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis (without consideration of evidence) is unlikely, then $P (H)$ is small (but not necessarily astronomically small) and $\frac{1}{P (H)}$ is much larger than 1 and this term can be approximated as $\frac{P (E ∣ \neg H)}{P (E ∣ H) \cdot P (H)}$ and relevant probabilities can be compared directly to each other.

One quick and easy way to remember the equation would be to use rule of multiplication: $P (E \cap H) = P (E ∣ H) P (H) = P (H ∣ E) P (E) .$

Alternatives to Bayesian updating

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote: "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability. The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.

Inference over exclusive and exhaustive possibilities

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation

Suppose a process is generating independent and identically distributed events $E_{n}, n = 1, 2, 3, \dots$ , but the probability distribution is unknown. Let the event space $Ω$ represent the current state of belief for this process. Each model is represented by event $M_{m}$ . The conditional probabilities $P (E_{n} ∣ M_{m})$ are specified to define the models. $P (M_{m})$ is the degree of belief in $M_{m}$ . Before the first inference step, ${P (M_{m})}$ is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate $E \in {E_{n}}$ . For each $M \in {M_{m}}$ , the prior $P (M)$ is updated to the posterior $P (M ∣ E)$ . From Bayes' theorem:

$P (M ∣ E) = \frac{P (E ∣ M)}{\sum_{m} P (E ∣ M_{m}) P (M_{m})} \cdot P (M) .$

Upon observation of further evidence, this procedure may be repeated.

Multiple observations

For a sequence of independent and identically distributed observations $E = (e_{1}, \dots, e_{n})$ , it can be shown by induction that repeated application of the above is equivalent to $P (M ∣ E) = \frac{P (E ∣ M)}{\sum_{m} P (E ∣ M_{m}) P (M_{m})} \cdot P (M),$ where $P (E ∣ M) = \prod_{k} P (e_{k} ∣ M) .$

Parametric formulation: motivating the formal description

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.

Let the vector $θ$ span the parameter space. Let the initial prior distribution over $θ$ be $p (θ ∣ α)$ , where $α$ is a set of parameters to the prior itself, or hyperparameters. Let $E = (e_{1}, \dots, e_{n})$ be a sequence of independent and identically distributed event observations, where all $e_{i}$ are distributed as $p (e ∣ θ)$ for some $θ$ . Bayes' theorem is applied to find the posterior distribution over $θ$ :

$\begin{aligned} p (θ ∣ E, α) & = \frac{p (E ∣ θ, α)}{p (E ∣ α)} \cdot p (θ ∣ α) \\ = \frac{p (E ∣ θ, α)}{\int p (E ∣ θ, α) p (θ ∣ α) d θ} \cdot p (θ ∣ α), \end{aligned}$ where $p (E ∣ θ, α) = \prod_{k} p (e_{k} ∣ θ) .$

Formal description of Bayesian inference

Definitions

$x$ , a data point in general. This may in fact be a vector of values.
$θ$ , the parameter of the data point's distribution, i.e., $x \sim p (x ∣ θ)$ . This may be a vector of parameters.
$α$ , the hyperparameter of the parameter distribution, i.e., $θ \sim p (θ ∣ α)$ . This may be a vector of hyperparameters.
$X$ is the sample, a set of $n$ observed data points, i.e., $x_{1}, \dots, x_{n}$ .
$\tilde{x}$ , a new data point whose distribution is to be predicted.

Bayesian inference

The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. $p (θ ∣ α)$ . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. $p (X ∣ θ)$ . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written $L (θ ∣ X) = p (X ∣ θ)$ .
The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. $p (X ∣ α) = \int p (X ∣ θ) p (θ ∣ α) d θ .$ It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.^[6] If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference: $p (θ ∣ X, α) = \frac{p (θ, X, α)}{p (X, α)} = \frac{p (X ∣ θ, α) p (θ, α)}{p (X ∣ α) p (α)} = \frac{p (X ∣ θ, α) p (θ ∣ α)}{p (X ∣ α)} \propto p (X ∣ θ, α) p (θ ∣ α) .$ This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution $p (θ ∣ X, α)$ is not obtained in a closed form distribution, mainly because the parameter space for $θ$ can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations $X$ and parameter $θ$ . In such situations, we need to resort to approximation techniques.
General case: Let $P_{Y}^{x}$ be the conditional distribution of $Y$ given $X = x$ and let $P_{X}$ be the distribution of $X$ . The joint distribution is then $P_{X, Y} (d x, d y) = P_{Y}^{x} (d y) P_{X} (d x)$ . The conditional distribution $P_{X}^{y}$ of $X$ given $Y = y$ is then determined by

$P_{X}^{y} (A) = E (1_{A} (X) | Y = y)$ Existence and uniqueness of the needed conditional expectation is a consequence of the Radon–Nikodym theorem. This was formulated by Kolmogorov in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface. The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions. Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line. Modern Markov chain Monte Carlo methods have boosted the importance of Bayes' theorem including cases with improper priors.

Bayesian prediction

The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior: $p (\tilde{x} ∣ X, α) = \int p (\tilde{x} ∣ θ) p (θ ∣ X, α) d θ$
The prior predictive distribution is the distribution of a new data point, marginalized over the prior: $p (\tilde{x} ∣ α) = \int p (\tilde{x} ∣ θ) p (θ ∣ α) d θ$

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.

In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Mathematical properties

Interpretation of factor

$\frac{P (E ∣ M)}{P (E)} > 1 \Rightarrow P (E ∣ M) > P (E)$ . That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, $\frac{P (E ∣ M)}{P (E)} = 1 \Rightarrow P (E ∣ M) = P (E)$ . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule

If $P (M) = 0$ then $P (M ∣ E) = 0$ . If $P (M) = 1$ and $P (E) > 0$ , then $P (M | E) = 1$ . This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not $M$ " in place of " $M$ ", yielding "if $1 - P (M) = 0$ , then $1 - P (M ∣ E) = 0$ ", from which the result immediately follows.

Asymptotic behaviour of posterior

Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 and 1965 when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces. To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation. $\tilde{θ} = E [θ] = \int θ p (θ ∣ X, α) d θ$

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates: ${θ_{MAP}} \subset \arg max_{θ} p (θ ∣ X, α) .$

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").

The posterior predictive distribution of a new observation $\tilde{x}$ (that is independent of previous observations) is determined by $p (\tilde{x} | X, α) = \int p (\tilde{x}, θ ∣ X, α) d θ = \int p (\tilde{x} ∣ θ) p (θ ∣ X, α) d θ .$

Examples

Probability of a hypothesis

Contingency table
Bowl Cookie	#1 H₁	#2 H₂	Total
Plain, E	30	20	50
Choc, ¬E	10	20	30
Total	40	40	80
P(H₁\|E) = 30 / 50 = 0.6

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let $H_{1}$ correspond to bowl #1, and $H_{2}$ to bowl #2. It is given that the bowls are identical from Fred's point of view, thus $P (H_{1}) = P (H_{2})$ , and the two must add up to 1, so both are equal to 0.5. The event $E$ is the observation of a plain cookie. From the contents of the bowls, we know that $P (E ∣ H_{1}) = 30 / 40 = 0.75$ and $P (E ∣ H_{2}) = 20 / 40 = 0.5.$ Bayes' formula then yields $\begin{aligned} P (H_{1} ∣ E) & = \frac{P (E ∣ H_{1}) P (H_{1})}{P (E ∣ H_{1}) P (H_{1}) + P (E ∣ H_{2}) P (H_{2})} \\ = \frac{0.75 \times 0.5}{0.75 \times 0.5 + 0.5 \times 0.5} \\ = 0.6 \end{aligned}$

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, $P (H_{1})$ , which was 0.5. After observing the cookie, we must revise the probability to $P (H_{1} ∣ E)$ , which is 0.6.

Making a prediction

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

The degree of belief in the continuous variable $C$ (century) is to be calculated, with the discrete set of events ${G D, G \bar{D}, \bar{G} D, \bar{G} \bar{D}}$ as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

$P (E = G D ∣ C = c) = (0.01 + \frac{0.81 - 0.01}{16 - 11} (c - 11)) (0.5 - \frac{0.5 - 0.05}{16 - 11} (c - 11))$ $P (E = G \bar{D} ∣ C = c) = (0.01 + \frac{0.81 - 0.01}{16 - 11} (c - 11)) (0.5 + \frac{0.5 - 0.05}{16 - 11} (c - 11))$ $P (E = \bar{G} D ∣ C = c) = ((1 - 0.01) - \frac{0.81 - 0.01}{16 - 11} (c - 11)) (0.5 - \frac{0.5 - 0.05}{16 - 11} (c - 11))$ $P (E = \bar{G} \bar{D} ∣ C = c) = ((1 - 0.01) - \frac{0.81 - 0.01}{16 - 11} (c - 11)) (0.5 + \frac{0.5 - 0.05}{16 - 11} (c - 11))$

Assume a uniform prior of $f_{C} (c) = 0.2$ , and that trials are independent and identically distributed. When a new fragment of type $e$ is discovered, Bayes' theorem is applied to update the degree of belief for each $c$ : $f_{C} (c ∣ E = e) = \frac{P (E = e ∣ C = c)}{P (E = e)} f_{C} (c) = \frac{P (E = e ∣ C = c)}{\int_{11}^{16} P (E = e ∣ C = c) f_{C} (c) d c} f_{C} (c)$

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or $c = 15.2$ . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events ${G D, G \bar{D}, \bar{G} D, \bar{G} \bar{D}}$ is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory

A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals. For example:

"Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."
"In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."
"In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."
"A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"
"An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."

Model selection

Bayesian methodology also plays a role in model selection where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest posterior probability given the data is selected. The posterior probability of a model depends on the evidence, or marginal likelihood, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the Bayes factor. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule or the MAP probability rule.

Probabilistic programming

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.

Applications

Statistical data analysis

See the separate Wikipedia entry on Bayesian statistics, specifically the statistical modeling section in that page.

Computer applications

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes. Recently Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to statistical classification, Bayesian inference has been used to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naïve Bayes classifier.

Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occam's Razor. Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.

Bioinformatics and healthcare applications

Bayesian inference has been applied in different Bioinformatics applications, including differential gene expression analysis. Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.

In the courtroom

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "beyond a reasonable doubt". Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population. For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

Gardner-Medwin argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A – the known facts and testimony could have arisen if the defendant is guilty.

B – the known facts and testimony could have arisen if the defendant is innocent.

C – the defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Bayesian epistemology

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences: It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other

The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments. The Bayesian inference has also been applied to treat stochastic scheduling problems with incomplete information by Cai et al. (2009).
Bayesian search theory is used to search for lost objects.
Bayesian inference in phylogeny
Bayesian tool for methylation analysis
Bayesian approaches to brain function investigate the brain as a Bayesian mechanism.
Bayesian inference in ecological studies
Bayesian inference is used to estimate parameters in stochastic chemical kinetic models
Bayesian inference in econophysics for currency or prediction of trend changes in financial quotations
Bayesian inference in marketing
Bayesian inference in motor learning
Bayesian inference is used in probabilistic numerics to solve numerical problems

Bayes and Bayesian inference

The problem considered by Bayes in Proposition 9 of his essay, "An Essay Towards Solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.

History

The term Bayesian refers to Thomas Bayes (1701–1761), who proved that probabilistic limits could be placed on an unknown event. However, it was Pierre-Simon Laplace (1749–1827) who introduced (as Principle VI) what is now called Bayes' theorem and used it to address problems in celestial mechanics, medical statistics, reliability, and jurisprudence. Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed, and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications. Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics. Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.

Search This Blog

Tuesday, December 24, 2024

Occam's razor

History

Formulations before William of Ockham

William of Ockham

Later formulations

Justifications

Aesthetic

Empirical

Testing the razor

Practical considerations and pragmatism

Mathematical

Other philosophers

Karl Popper

Elliott Sober

Richard Swinburne

Ludwig Wittgenstein

Uses

Science and the scientific method

Biology

Religion

Philosophy of mind

Penal ethics

Probability theory and statistics

Objective razor

Mathematical arguments against Occam's razor

Software development

Controversial aspects

Anti-razors

Other

Monday, December 23, 2024

Bayesian inference

Introduction to Bayes' rule

Formal explanation

Alternatives to Bayesian updating

Inference over exclusive and exhaustive possibilities

General formulation

Multiple observations

Parametric formulation: motivating the formal description

Formal description of Bayesian inference

Definitions

Bayesian inference

Bayesian prediction

Mathematical properties

Interpretation of factor

Cromwell's rule

Asymptotic behaviour of posterior

Conjugate priors

Estimates of parameters and predictions

Examples

Probability of a hypothesis

Making a prediction

In frequentist statistics and decision theory

Model selection

Probabilistic programming

Applications

Statistical data analysis

Computer applications

Bioinformatics and healthcare applications

In the courtroom

Bayesian epistemology

Other

Bayes and Bayesian inference

History

Energy flow (ecology)