Search This Blog

Tuesday, November 6, 2018

History of statistics

From Wikipedia, the free encyclopedia

The history of statistics in the modern sense dates from the mid-17th century, with the term statistics itself coined in 1749 in German, although there have been changes to the interpretation of the word over time. The development of statistics is intimately connected on the one hand with the development of sovereign states, particularly European states following the Peace of Westphalia (1648); and the other hand with the development of probability theory, which put statistics on a firm theoretical basis.

In early times, the meaning was restricted to information about states, particularly demographics such as population. This was later extended to include all collections of information of all types, and later still it was extended to include the analysis and interpretation of such data. In modern terms, "statistics" means both sets of collected information, as in national accounts and temperature records, and analytical work which requires statistical inference. Statistical activities are often associated with models expressed using probabilities, hence the connection with probability theory. The large requirements of data processing have made statistics a key application of computing; see history of computing hardware. A number of statistical concepts have an important impact on a wide range of sciences. These include the design of experiments and approaches to statistical inference such as Bayesian inference, each of which can be considered to have their own sequence in the development of the ideas underlying modern statistics.

Introduction

By the 18th century, the term "statistics" designated the systematic collection of demographic and economic data by states. For at least two millennia, these data were mainly tabulations of human and material resources that might be taxed or put to military use. In the early 19th century, collection intensified, and the meaning of "statistics" broadened to include the discipline concerned with the collection, summary, and analysis of data. Today, data is collected and statistics are computed and widely distributed in government, business, most of the sciences and sports, and even for many pastimes. Electronic computers have expedited more elaborate statistical computation even as they have facilitated the collection and aggregation of data. A single data analyst may have available a set of data-files with millions of records, each with dozens or hundreds of separate measurements. These were collected over time from computer activity (for example, a stock exchange) or from computerized sensors, point-of-sale registers, and so on. Computers then produce simple, accurate summaries, and allow more tedious analyses, such as those that require inverting a large matrix or perform hundreds of steps of iteration, that would never be attempted by hand. Faster computing has allowed statisticians to develop "computer-intensive" methods which may look at all permutations, or use randomization to look at 10,000 permutations of a problem, to estimate answers that are not easy to quantify by theory alone.

The term "mathematical statistics" designates the mathematical theories of probability and statistical inference, which are used in statistical practice. The relation between statistics and probability theory developed rather late, however. In the 19th century, statistics increasingly used probability theory, whose initial results were found in the 17th and 18th centuries, particularly in the analysis of games of chance (gambling). By 1800, astronomy used probability models and statistical theories, particularly the method of least squares. Early probability theory and statistics was systematized in the 19th century and statistical reasoning and probability models were used by social scientists to advance the new sciences of experimental psychology and sociology, and by physical scientists in thermodynamics and statistical mechanics. The development of statistical reasoning was closely associated with the development of inductive logic and the scientific method, which are concerns that move statisticians away from the narrower area of mathematical statistics. Much of the theoretical work was readily available by the time computers were available to exploit them. By the 1970s, Johnson and Kotz produced a four-volume Compendium on Statistical Distributions (1st ed., 1969-1972), which is still an invaluable resource.

Applied statistics can be regarded as not a field of mathematics but an autonomous mathematical science, like computer science and operations research. Unlike mathematics, statistics had its origins in public administration. Applications arose early in demography and economics; large areas of micro- and macro-economics today are "statistics" with an emphasis on time-series analyses. With its emphasis on learning from data and making best predictions, statistics also has been shaped by areas of academic research including psychological testing, medicine and epidemiology. The ideas of statistical testing have considerable overlap with decision science. With its concerns with searching and effectively presenting data, statistics has overlap with information science and computer science.

Etymology

The term statistics is ultimately derived from the New Latin statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). The German Statistik, first introduced by Gottfried Achenwall (1749), originally designated the analysis of data about the state, signifying the "science of state" (then called political arithmetic in English). It acquired the meaning of the collection and classification of data generally in the early 19th century. It was introduced into English in 1791 by Sir John Sinclair when he published the first of 21 volumes titled Statistical Account of Scotland.

Thus, the original principal purpose of Statistik was data to be used by governmental and (often centralized) administrative bodies. The collection of data about states and localities continues, largely through national and international statistical services. In particular, censuses provide frequently updated information about the population.

The first book to have 'statistics' in its title was "Contributions to Vital Statistics" (1845) by Francis GP Neison, actuary to the Medical Invalid and General Life Office.

Origins in probability theory

Basic forms of statistics have been used since the beginning of civilization. Early empires often collated censuses of the population or recorded the trade in various commodities. The Roman Empire was one of the first states to extensively gather data on the size of the empire's population, geographical area and wealth.

The use of statistical methods dates back to least to the 5th century BCE. The historian Thucydides in his History of the Peloponnesian War  describes how the Athenians calculated the height of the wall of Platea by counting the number of bricks in an unplastered section of the wall sufficiently near them to be able to count them. The count was repeated several times by a number of soldiers. The most frequent value (in modern terminology - the mode ) so determined was taken to be the most likely value of the number of bricks. Multiplying this value by the height of the bricks used in the wall allowed the Athenians to determine the height of the ladders necessary to scale the walls.

The earliest writing on statistics was found in a 9th-century book entitled: "Manuscript on Deciphering Cryptographic Messages", written by Al-Kindi (801–873 CE). In his book, Al-Kindi gave a detailed description of how to use statistics and frequency analysis to decipher encrypted messages. This text arguably gave rise to the birth of both statistics and cryptanalysis.

The Trial of the Pyx is a test of the purity of the coinage of the Royal Mint which has been held on a regular basis since the 12th century. The Trial itself is based on statistical sampling methods. After minting a series of coins - originally from ten pounds of silver - a single coin was placed in the Pyx - a box in Westminster Abbey. After a given period - now once a year - the coins are removed and weighed. A sample of coins removed from the box are then tested for purity.

The Nuova Cronica, a 14th-century history of Florence by the Florentine banker and official Giovanni Villani, includes much statistical information on population, ordinances, commerce and trade, education, and religious facilities and has been described as the first introduction of statistics as a positive element in history, though neither the term nor the concept of statistics as a specific field yet existed. But this was proven to be incorrect after the rediscovery of Al-Kindi's book on frequency analysis.

The arithmetic mean, although a concept known to the Greeks, was not generalised to more than two values until the 16th century. The invention of the decimal system by Simon Stevin in 1585 seems likely to have facilitated these calculations. This method was first adopted in astronomy by Tycho Brahe who was attempting to reduce the errors in his estimates of the locations of various celestial bodies.

The idea of the median originated in Edward Wright's book on navigation (Certaine Errors in Navigation) in 1599 in a section concerning the determination of location with a compass. Wright felt that this value was the most likely to be the correct value in a series of observations.

Sir William Petty, a 17th-century economist who used early statistical methods to analyse demographic data.

The birth of statistics is often dated to 1662, when John Graunt, along with William Petty, developed early human statistical and census methods that provided a framework for modern demography. He produced the first life table, giving probabilities of survival to each age. His book Natural and Political Observations Made upon the Bills of Mortality used analysis of the mortality rolls to make the first statistically based estimation of the population of London. He knew that there were around 13,000 funerals per year in London and that three people died per eleven families per year. He estimated from the parish records that the average family size was 8 and calculated that the population of London was about 384,000; this is the first known use of a ratio estimator. Laplace in 1802 estimated the population of France with a similar method.

Although the original scope of statistics was limited to data useful for governance, the approach was extended to many fields of a scientific or commercial nature during the 19th century. The mathematical foundations for the subject heavily drew on the new probability theory, pioneered in the 16th century by Gerolamo Cardano, Pierre de Fermat and Blaise Pascal. Christiaan Huygens (1657) gave the earliest known scientific treatment of the subject. Jakob Bernoulli's Ars Conjectandi (posthumous, 1713) and Abraham de Moivre's The Doctrine of Chances (1718) treated the subject as a branch of mathematics. In his book Bernoulli introduced the idea of representing complete certainty as one and probability as a number between zero and one.

A key early application of statistics in the 18th century was to the human sex ratio at birth. John Arbuthnot studied this question in 1710. Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.5^82, or about 1 in 4,8360,0000,0000,0000,0000,0000; in modern terms, the p-value. This is vanishingly small, leading Arbuthnot that this was not due to chance, but to divine providence: "From whence it follows, that it is Art, not Chance, that governs." This is and other work by Arbuthnot is credited as "the first use of significance tests" the first example of reasoning about statistical significance and moral certainty, and "… perhaps the first published report of a nonparametric test …", specifically the sign test.

The formal study of theory of errors may be traced back to Roger Cotes' Opera Miscellanea (posthumous, 1722), but a memoir prepared by Thomas Simpson in 1755 (printed 1756) first applied the theory to the discussion of errors of observation. The reprint (1757) of this memoir lays down the axioms that positive and negative errors are equally probable, and that there are certain assignable limits within which all errors may be supposed to fall; continuous errors are discussed and a probability curve is given. Simpson discussed several possible distributions of error. He first considered the uniform distribution and then the discrete symmetric triangular distribution followed by the continuous symmetric triangle distribution. Tobias Mayer, in his study of the libration of the moon (Kosmographische Nachrichten, Nuremberg, 1750), invented the first formal method for estimating the unknown quantities by generalized the averaging of observations under identical circumstances to the averaging of groups of similar equations.

Roger Joseph Boscovich in 1755 based in his work on the shape of the earth proposed in his book De Litteraria expeditione per pontificiam ditionem ad dimetiendos duos meridiani gradus a PP. Maire et Boscovicli that the true value of a series of observations would be that which minimises the sum of absolute errors. In modern terminology this value is the median. The first example of what later became known as the normal curve was studied by Abraham de Moivre who plotted this curve on November 12, 1733. de Moivre was studying the number of heads that occurred when a 'fair' coin was tossed.

In 1761 Thomas Bayes proved Bayes' theorem and in 1765 Joseph Priestley invented the first timeline charts.

Johann Heinrich Lambert in his 1765 book Anlage zur Architectonic proposed the semicircle as a distribution of errors:
with -1 < x < 1.

Probability density plots for the Laplace distribution.

Pierre-Simon Laplace (1774) made the first attempt to deduce a rule for the combination of observations from the principles of the theory of probabilities. He represented the law of probability of errors by a curve and deduced a formula for the mean of three observations.

Laplace in 1774 noted that the frequency of an error could be expressed as an exponential function of its magnitude once its sign was disregarded. This distribution is now known as the Laplace distribution. Lagrange proposed a parabolic distribution of errors in 1776.

Laplace in 1778 published his second law of errors wherein he noted that the frequency of an error was proportional to the exponential of the square of its magnitude. This was subsequently rediscovered by Gauss (possibly in 1795) and is now best known as the normal distribution which is of central importance in statistics. This distribution was first referred to as the normal distribution by C. S. Peirce in 1873 who was studying measurement errors when an object was dropped onto a wooden base. He chose the term normal because of its frequent occurrence in naturally occurring variables.

Lagrange also suggested in 1781 two other distributions for errors - a raised cosine distribution and a logarithmic distribution.

Laplace gave (1781) a formula for the law of facility of error (a term due to Joseph Louis Lagrange, 1774), but one which led to unmanageable equations. Daniel Bernoulli (1778) introduced the principle of the maximum product of the probabilities of a system of concurrent errors.

In 1786 William Playfair (1759-1823) introduced the idea of graphical representation into statistics. He invented the line chart, bar chart and histogram and incorporated them into his works on economics, the Commercial and Political Atlas. This was followed in 1795 by his invention of the pie chart and circle chart which he used to display the evolution of England's imports and exports. These latter charts came to general attention when he published examples in his Statistical Breviary in 1801.

Laplace, in an investigation of the motions of Saturn and Jupiter in 1787, generalized Mayer's method by using different linear combinations of a single group of equations.

In 1791 Sir John Sinclair introduced the term 'statistics' into English in his Statistical Accounts of Scotland.

In 1802 Laplace estimated the population of France to be 28,328,612. He calculated this figure using the number of births in the previous year and census data for three communities. The census data of these communities showed that they had 2,037,615 persons and that the number of births were 71,866. Assuming that these samples were representative of France, Laplace produced his estimate for the entire population.

Carl Friedrich Gauss, mathematician who developed the method of least squares in 1809.

The method of least squares, which was used to minimize errors in data measurement, was published independently by Adrien-Marie Legendre (1805), Robert Adrain (1808), and Carl Friedrich Gauss (1809). Gauss had used the method in his famous 1801 prediction of the location of the dwarf planet Ceres. The observations that Gauss based his calculations on were made by the Italian monk Piazzi.

The term probable error (der wahrscheinliche Fehler) - the median deviation from the mean - was introduced in 1815 by the German astronomer Frederik Wilhelm Bessel. Antoine Augustin Cournot in 1843 was the first to use the term median (valeur médiane) for the value that divides a probability distribution into two equal halves.

Other contributors to the theory of errors were Ellis (1844), De Morgan (1864), Glaisher (1872), and Giovanni Schiaparelli (1875). Peters's (1856) formula for , the "probable error" of a single observation was widely used and inspired early robust statistics.

In the 19th century authors on statistical theory included Laplace, S. Lacroix (1816), Littrow (1833), Dedekind (1860), Helmert (1872), Laurent (1873), Liagre, Didion, De Morgan and Boole.

Gustav Theodor Fechner used the median (Centralwerth) in sociological and psychological phenomena. It had earlier been used only in astronomy and related fields. Francis Galton used the English term median for the first time in 1881 having earlier used the terms middle-most value in 1869 and the medium in 1880.

Adolphe Quetelet (1796–1874), another important founder of statistics, introduced the notion of the "average man" (l'homme moyen) as a means of understanding complex social phenomena such as crime rates, marriage rates, and suicide rates.

The first tests of the normal distribution were invented by the German statistician Wilhelm Lexis in the 1870s. The only data sets available to him that he was able to show were normally distributed were birth rates.

Development of modern statistics

Although the origins of statistical theory lie in the 18th-century advances in probability, the modern field of statistics only emerged in the late-19th and early-20th century in three stages. The first wave, at the turn of the century, was led by the work of Francis Galton and Karl Pearson, who transformed statistics into a rigorous mathematical discipline used for analysis, not just in science, but in industry and politics as well. The second wave of the 1910s and 20s was initiated by William Gosset, and reached its culmination in the insights of Ronald Fisher. This involved the development of better design of experiments models, hypothesis testing and techniques for use with small data samples. The final wave, which mainly saw the refinement and expansion of earlier developments, emerged from the collaborative work between Egon Pearson and Jerzy Neyman in the 1930s. Today, statistical methods are applied in all fields that involve decision making, for making accurate inferences from a collated body of data and for making decisions in the face of uncertainty based on statistical methodology.

The original logo of the Royal Statistical Society, founded in 1834.

The first statistical bodies were established in the early 19th century. The Royal Statistical Society was founded in 1834 and Florence Nightingale, its first female member, pioneered the application of statistical analysis to health problems for the furtherance of epidemiological understanding and public health practice. However, the methods then used would not be considered as modern statistics today.

The Oxford scholar Francis Ysidro Edgeworth's book, Metretike: or The Method of Measuring Probability and Utility (1887) dealt with probability as the basis of inductive reasoning, and his later works focused on the 'philosophy of chance'. His first paper on statistics (1883) explored the law of error (normal distribution), and his Methods of Statistics (1885) introduced an early version of the t distribution, the Edgeworth expansion, the Edgeworth series, the method of variate transformation and the asymptotic theory of maximum likelihood estimates.

The Norwegian Anders Nicolai Kiær introduced the concept of stratified sampling in 1895. Arthur Lyon Bowley introduced new methods of data sampling in 1906 when working on social statistics. Although statistical surveys of social conditions had started with Charles Booth's "Life and Labour of the People in London" (1889-1903) and Seebohm Rowntree's "Poverty, A Study of Town Life" (1901), Bowley's, key innovation consisted of the use of random sampling techniques. His efforts culminated in his New Survey of London Life and Labour.

Francis Galton is credited as one of the principal founders of statistical theory. His contributions to the field included introducing the concepts of standard deviation, correlation, regression and the application of these methods to the study of the variety of human characteristics - height, weight, eyelash length among others. He found that many of these could be fitted to a normal curve distribution.

Galton submitted a paper to Nature in 1907 on the usefulness of the median. He examined the accuracy of 787 guesses of the weight of an ox at a country fair. The actual weight was 1208 pounds: the median guess was 1198. The guesses were markedly non-normally distributed.


Galton's publication of Natural Inheritance in 1889 sparked the interest of a brilliant mathematician, Karl Pearson, then working at University College London, and he went on to found the discipline of mathematical statistics. He emphasised the statistical foundation of scientific laws and promoted its study and his laboratory attracted students from around the world attracted by his new methods of analysis, including Udny Yule. His work grew to encompass the fields of biology, epidemiology, anthropometry, medicine and social history. In 1901, with Walter Weldon, founder of biometry, and Galton, he founded the journal Biometrika as the first journal of mathematical statistics and biometry.

His work, and that of Galton's, underpins many of the 'classical' statistical methods which are in common use today, including the Correlation coefficient, defined as a product-moment; the method of moments for the fitting of distributions to samples; Pearson's system of continuous curves that forms the basis of the now conventional continuous probability distributions; Chi distance a precursor and special case of the Mahalanobis distance and P-value, defined as the probability measure of the complement of the ball with the hypothesized value as center point and chi distance as radius. He also introduced the term 'standard deviation'.

He also founded the statistical hypothesis testing theory, Pearson's chi-squared test and principal component analysis. In 1911 he founded the world's first university statistics department at University College London.

Ronald Fisher, "A genius who almost single-handedly created the foundations for modern statistical science"
The second wave of mathematical statistics was pioneered by Ronald Fisher who wrote two textbooks, Statistical Methods for Research Workers, published in 1925 and The Design of Experiments in 1935, that were to define the academic discipline in universities around the world. He also systematized previous results, putting them on a firm mathematical footing. In his 1918 seminal paper The Correlation between Relatives on the Supposition of Mendelian Inheritance, the first use to use the statistical term, variance. In 1919, at Rothamsted Experimental Station he started a major study of the extensive collections of data recorded over many years. This resulted in a series of reports under the general title Studies in Crop Variation. In 1930 he published The Genetical Theory of Natural Selection where he applied statistics to evolution.

Over the next seven years, he pioneered the principles of the design of experiments (see below) and elaborated his studies of analysis of variance. He furthered his studies of the statistics of small samples. Perhaps even more important, he began his systematic approach of the analysis of real data as the springboard for the development of new statistical methods. He developed computational algorithms for analyzing data from his balanced experimental designs. In 1925, this work resulted in the publication of his first book, Statistical Methods for Research Workers. This book went through many editions and translations in later years, and it became the standard reference work for scientists in many disciplines. In 1935, this book was followed by The Design of Experiments, which was also widely used.

In addition to analysis of variance, Fisher named and promoted the method of maximum likelihood estimation. Fisher also originated the concepts of sufficiency, ancillary statistics, Fisher's linear discriminator and Fisher information. His article On a distribution yielding the error functions of several well known statistics (1924) presented Pearson's chi-squared test and William Gosset's t in the same framework as the Gaussian distribution, and his own parameter in the analysis of variance Fisher's z-distribution (more commonly used decades later in the form of the F distribution). The 5% level of significance appears to have been introduced by Fisher in 1925. Fisher stated that deviations exceeding twice the standard deviation are regarded as significant. Before this deviations exceeding three times the probable error were considered significant. For a symmetrical distribution the probable error is half the interquartile range. For a normal distribution the probable error is approximately 2/3 the standard deviation. It appears that Fisher's 5% criterion was rooted in previous practice.

Other important contributions at this time included Charles Spearman's rank correlation coefficient that was a useful extension of the Pearson correlation coefficient. William Sealy Gosset, the English statistician better known under his pseudonym of Student, introduced Student's t-distribution, a continuous probability distribution useful in situations where the sample size is small and population standard deviation is unknown.

Egon Pearson (Karl's son) and Jerzy Neyman introduced the concepts of "Type II" error, power of a test and confidence intervals. Jerzy Neyman in 1934 showed that stratified random sampling was in general a better method of estimation than purposive (quota) sampling.

Design of experiments

James Lind carried out the first ever clinical trial in 1747, in an effort to find a treatment for scurvy.

In 1747, while serving as surgeon on HM Bark Salisbury, James Lind carried out a controlled experiment to develop a cure for scurvy. In this study his subjects' cases "were as similar as I could have them", that is he provided strict entry requirements to reduce extraneous variation. The men were paired, which provided blocking. From a modern perspective, the main thing that is missing is randomized allocation of subjects to treatments.

Lind is today often described as a one-factor-at-a-time experimenter. Similar one-factor-at-a-time (OFAT) experimentation was performed at the Rothamsted Research Station in the 1840s by Sir John Lawes to determine the optimal inorganic fertilizer for use on wheat.

A theory of statistical inference was developed by Charles S. Peirce in "Illustrations of the Logic of Science" (1877–1878) and "A Theory of Probable Inference" (1883), two publications that emphasized the importance of randomization-based inference in statistics. In another study, Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights.

Peirce's experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the 1800s. Peirce also contributed the first English-language publication on an optimal design for regression-models in 1876. A pioneering optimal design for polynomial regression was suggested by Gergonne in 1815. In 1918 Kirstine Smith published optimal designs for polynomials of degree six (and less).

The use of a sequence of experiments, where the design of each may depend on the results of previous experiments, including the possible decision to stop experimenting, was pioneered by Abraham Wald in the context of sequential tests of statistical hypotheses. Surveys are available of optimal sequential designs, and of adaptive designs. One specific type of sequential design is the "two-armed bandit", generalized to the multi-armed bandit, on which early work was done by Herbert Robbins in 1952.

The term "design of experiments" (DOE) derives from early statistical work performed by Sir Ronald Fisher. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science." Fisher initiated the principles of design of experiments and elaborated on his studies of "analysis of variance". Perhaps even more important, Fisher began his systematic approach to the analysis of real data as the springboard for the development of new statistical methods. He began to pay particular attention to the labour involved in the necessary computations performed by hand, and developed methods that were as practical as they were founded in rigour. In 1925, this work culminated in the publication of his first book, Statistical Methods for Research Workers. This went into many editions and translations in later years, and became a standard reference work for scientists in many disciplines.

A methodology for designing experiments was proposed by Ronald A. Fisher, in his innovative book The Design of Experiments (1935) which also became a standard. As an example, he described how to test the hypothesis that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed in the cup. While this sounds like a frivolous application, it allowed him to illustrate the most important ideas of experimental design: see Lady tasting tea.

Agricultural science advances served to meet the combination of larger city populations and fewer farms. But for crop scientists to take due account of widely differing geographical growing climates and needs, it was important to differentiate local growing conditions. To extrapolate experiments on local crops to a national scale, they had to extend crop sample testing economically to overall populations. As statistical methods advanced (primarily the efficacy of designed experiments instead of one-factor-at-a-time experimentation), representative factorial design of experiments began to enable the meaningful extension, by inference, of experimental sampling results to the population as a whole. But it was hard to decide how representative was the crop sample chosen. Factorial design methodology showed how to estimate and correct for any random variation within the sample and also in the data collection procedures.

Bayesian statistics

Pierre-Simon, marquis de Laplace, one of the main early developers of Bayesian statistics.

The term Bayesian refers to Thomas Bayes (1702–1761), who proved a special case of what is now called Bayes' theorem. However it was Pierre-Simon Laplace (1749–1827) who introduced a general version of the theorem and applied it to celestial mechanics, medical statistics, reliability, and jurisprudence. When insufficient knowledge was available to specify an informed prior, Laplace used uniform priors, according to his "principle of insufficient reason". Laplace assumed uniform priors for mathematical simplicity rather than for philosophical reasons. Laplace also introduced primitive versions of conjugate priors and the theorem of von Mises and Bernstein, according to which the posteriors corresponding to initially differing priors ultimately agree, as the number of observations increases. This early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes).

After the 1920s, inverse probability was largely supplanted[citation needed] by a collection of methods that were developed by Ronald A. Fisher, Jerzy Neyman and Egon Pearson. Their methods came to be called frequentist statistics. Fisher rejected the Bayesian view, writing that "the theory of inverse probability is founded upon an error, and must be wholly rejected". At the end of his life, however, Fisher expressed greater respect for the essay of Bayes, which Fisher believed to have anticipated his own, fiducial approach to probability; Fisher still maintained that Laplace's views on probability were "fallacious rubbish". Neyman started out as a "quasi-Bayesian", but subsequently developed confidence intervals (a key method in frequentist statistics) because "the whole theory would look nicer if it were built from the start without reference to Bayesianism and priors". The word Bayesian appeared around 1950, and by the 1960s it became the term preferred by those dissatisfied with the limitations of frequentist statistics.

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objectivist stream, the statistical analysis depends on only the model assumed and the data analysed. No subjective decisions need to be involved. In contrast, "subjectivist" statisticians deny the possibility of fully objective analysis for the general case.

In the further development of Laplace's ideas, subjective ideas predate objectivist positions. The idea that 'probability' should be interpreted as 'subjective degree of belief in a proposition' was proposed, for example, by John Maynard Keynes in the early 1920s. This idea was taken further by Bruno de Finetti in Italy (Fondamenti Logici del Ragionamento Probabilistico, 1930) and Frank Ramsey in Cambridge (The Foundations of Mathematics, 1931). The approach was devised to solve problems with the frequentist definition of probability but also with the earlier, objectivist approach of Laplace. The subjective Bayesian methods were further developed and popularized in the 1950s by L.J. Savage.

Objective Bayesian inference was further developed by Harold Jeffreys at the University of Cambridge. His seminal book "Theory of probability" first appeared in 1939 and played an important role in the revival of the Bayesian view of probability. In 1957, Edwin Jaynes promoted the concept of maximum entropy for constructing priors, which is an important principle in the formulation of objective methods, mainly for discrete problems. In 1965, Dennis Lindley's 2-volume work "Introduction to Probability and Statistics from a Bayesian Viewpoint" brought Bayesian methods to a wide audience. In 1979, José-Miguel Bernardo introduced reference analysis, which offers a general applicable framework for objective analysis. Other well-known proponents of Bayesian probability theory include I.J. Good, B.O. Koopman, Howard Raiffa, Robert Schlaifer and Alan Turing.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications. Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics. Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.

Foundations of statistics

From Wikipedia, the free encyclopedia

The foundations of statistics concern the epistemological debate in statistics over how one should conduct inductive inference from data. Among the issues considered in statistical inference are the question of Bayesian inference versus frequentist inference, the distinction between Fisher's "significance testing" and NeymanPearson "hypothesis testing", and whether the likelihood principle should be followed. Some of these issues have been debated for up to 200 years without resolution.

Bandyopadhyay & Forster describe four statistical paradigms: "(1) classical statistics or error statistics, (ii) Bayesian statistics, (iii) likelihood-based statistics, and (iv) the Akaikean-Information Criterion-based statistics".

Savage's text Foundations of Statistics has been cited over 15000 times on Google Scholar. It tells the following.
It is unanimously agreed that statistics depends somehow on probability. But, as to what probability is and how it is connected with statistics, there has seldom been such complete disagreement and breakdown of communication since the Tower of Babel. Doubtless, much of the disagreement is merely terminological and would disappear under sufficiently sharp analysis.

Fisher's "significance testing" vs Neyman–Pearson "hypothesis testing"


In the development of classical statistics in the second quarter of the 20th century two competing models of inductive statistical testing were developed. Their relative merits were hotly debated (for over 25 years) until Fisher's death. While a hybrid of the two methods is widely taught and used, the philosophical questions raised in the debate have not been resolved.

Significance testing

Fisher popularized significance testing, primarily in two popular and highly influential books. Fisher's writing style in these books was strong on examples and relatively weak on explanations. The books lacked proofs or derivations of significance test statistics (which placed statistical practice in advance of statistical theory). Fisher's more explanatory and philosophical writing was written much later. There appear to be some differences between his earlier practices and his later opinions.

Fisher was motivated to obtain scientific experimental results without the explicit influence of prior opinion. The significance test is a probabilistic version of Modus tollens, a classic form of deductive inference. The significance test might be simplistically stated, "If the evidence is sufficiently discordant with the hypothesis, reject the hypothesis". In application, a statistic is calculated from the experimental data, a probability of exceeding that statistic is determined and the probability is compared to a threshold. The threshold (the numeric version of "sufficiently discordant") is arbitrary (usually decided by convention). A common application of the method is deciding whether a treatment has a reportable effect based on a comparative experiment. Statistical significance is a measure of probability not practical importance. It can be regarded as a requirement placed on statistical signal/noise. The method is based on the assumed existence of an imaginary infinite population corresponding to the null hypothesis.

The significance test requires only one hypothesis. The result of the test is to reject the hypothesis (or not), a simple dichotomy. The test distinguish between truth of the hypothesis and insufficiency of evidence to disprove the hypothesis; so it is like a criminal trial in which the defendant's guilt is assessed in (so it is like a criminal trial in which the defendant is assumed innocent until proven guilty).

Hypothesis testing

Neyman & Pearson collaborated on a different, but related, problem – selecting among competing hypotheses based on the experimental evidence alone. Of their joint papers the most cited was from 1933. The famous result of that paper is the Neyman–Pearson lemma. The lemma says that a ratio of probabilities is an excellent criterion for selecting a hypothesis (with the threshold for comparison being arbitrary). The paper proved an optimality of Student's t-test (one of the significance tests). Neyman expressed the opinion that hypothesis testing was a generalization of and an improvement on significance testing. The rationale for their methods is found in their joint papers.

Hypothesis testing requires multiple hypotheses. A hypothesis is always selected, a multiple choice. A lack of evidence is not an immediate consideration. The method is based on the assumption of a repeated sampling of the same population (the classical frequentist assumption).

Grounds of disagreement

The length of the dispute allowed the debate of a wide range of issues regarded as foundational to statistics.

An example exchange from 1955-1956
Fisher's attack Neyman's rebuttal Discussion
Repeated sampling of the same population
  • Such sampling is the basis of frequentist probability
  • Fisher preferred fiducial inference
Fisher's theory of fiducial inference is flawed
  • Paradoxes are common
Fisher's attack on the basis of frequentist probability failed, but was not without result. He identified a specific case (2x2 table) where the two schools of testing reach different results. This case is one of several that are still troubling. Commentators believe that the "right" answer is context dependent. Fiducial probability has not fared well, being virtually without advocates, while frequentist probability remains a mainstream interpretation.
Type II errors
  • Which result from an alternative hypothesis
A purely probabilistic theory of tests requires an alternative hypothesis Fisher's attack on type II errors has faded with time. In the intervening years statistics has separated the exploratory from the confirmatory. In the current environment, the concept of type II errors is used in power calculations for confirmatory hypothesis test sample size determination.
Inductive behavior
Fisher's attack on inductive behavior has been largely successful because of his selection of the field of battle. While operational decisions are routinely made on a variety of criteria (such as cost), scientific conclusions from experimentation are typically made on the basis of probability alone.
In this exchange, Fisher also discussed the requirements for inductive inference, with specific criticism of cost functions penalizing faulty judgments. Neyman countered that Gauss and Laplace used them. This exchange of arguments occurred 15 years after textbooks began teaching a hybrid theory of statistical testing.

Fisher and Neyman were in disagreement about the foundations of statistics (although united in vehement opposition to the Bayesian view):
  • The interpretation of probability
    • The disagreement over Fisher's inductive reasoning vs Neyman's inductive behavior contained elements of the Bayesian/Frequentist divide. Fisher was willing to alter his opinion (reaching a provisional conclusion) on the basis of a calculated probability while Neyman was more willing to change his observable behavior (making a decision) on the basis of a computed cost.
  • The proper formulation of scientific questions with special concern for modeling
  • Whether it is reasonable to reject a hypothesis based on a low probability without knowing the probability of an alternative
  • Whether a hypothesis could ever be accepted on the basis of data
    • In mathematics, deduction proves, counter-examples disprove
    • In the Popperian philosophy of science, advancements are made when theories are disproven
  • Subjectivity: While Fisher and Neyman struggled to minimize subjectivity, both acknowledged the importance of "good judgment". Each accused the other of subjectivity.
    • Fisher subjectively chose the null hypothesis.
    • Neyman–Pearson subjectively chose the criterion for selection (which was not limited to a probability).
    • Both subjectively determined numeric thresholds.
Fisher and Neyman were separated by attitudes and perhaps language. Fisher was a scientist and an intuitive mathematician. Inductive reasoning was natural. Neyman was a rigorous mathematician. He was convinced by deductive reasoning rather by a probability calculation based on an experiment. Thus there was an underlying clash between applied and theoretical, between science and mathematics.

Related history

Neyman, who had occupied the same building in England as Fisher, accepted a position on the west coast of the United States of America in 1938. His move effectively ended his collaboration with Pearson and their development of hypothesis testing. Further development was continued by others.

Textbooks provided a hybrid version of significance and hypothesis testing by 1940. None of the principals had any known personal involvement in the further development of the hybrid taught in introductory statistics today.

Statistics later developed in different directions including decision theory (and possibly game theory), Bayesian statistics, exploratory data analysis, robust statistics and nonparametric statistics. Neyman–Pearson hypothesis testing contributed strongly to decision theory which is very heavily used (in statistical quality control for example). Hypothesis testing readily generalized to accept prior probabilities which gave it a Bayesian flavor. Neyman–Pearson hypothesis testing has become an abstract mathematical subject taught in post-graduate statistics, while most of what is taught to under-graduates and used under the banner of hypothesis testing is from Fisher.

Contemporary opinion

No major battles between the two classical schools of testing have erupted for decades, but sniping continues (perhaps encouraged by partisans of other controversies). After generations of dispute, there is virtually no chance that either statistical testing theory will replace the other in the foreseeable future.

The hybrid of the two competing schools of testing can be viewed very differently – as the imperfect union of two mathematically complementary ideas or as the fundamentally flawed union of philosophically incompatible ideas. Fisher enjoyed some philosophical advantage, while Neyman & Pearson employed the more rigorous mathematics. Hypothesis testing is controversial among some users, but the most popular alternative (confidence intervals) is based on the same mathematics.

The history of the development left testing without a single citable authoritative source for the hybrid theory that reflects common statistical practice. The merged terminology is also somewhat inconsistent. There is strong empirical evidence that the graduates (and instructors) of an introductory statistics class have a weak understanding of the meaning of hypothesis testing.

Summary

  • The interpretation of probability has not been resolved (but fiducial probability is an orphan).
  • Neither test method has been rejected. Both are heavily used for different purposes.
  • Texts have merged the two test methods under the term hypothesis testing.
    • Mathematicians claim (with some exceptions) that significance tests are a special case of hypothesis tests.
    • Others treat the problems and methods as distinct (or incompatible).
  • The dispute has adversely affected statistical education.

Bayesian inference versus frequentist inference

Two different interpretations of probability (based on objective evidence and subjective degrees of belief) have long existed. Gauss and Laplace could have debated alternatives more than 200 years ago. Two competing schools of statistics have developed as a consequence. Classical inferential statistics was largely developed in the second quarter of the 20th Century, much of it in reaction to the (Bayesian) probability of the time which utilized the controversial principle of indifference to establish prior probabilities. The rehabilitation of Bayesian inference was a reaction to the limitations of frequentist probability. More reactions followed. While the philosophical interpretations are old, the statistical terminology is not. The current statistical terms "Bayesian" and "frequentist" stabilized in the second half of the 20th century. The (philosophical, mathematical, scientific, statistical) terminology is confusing: the "classical" interpretation of probability is Bayesian while "classical" statistics is frequentist. "Frequentist" also has varying interpretations—different in philosophy than in physics.

The nuances of philosophical probability interpretations are discussed elsewhere. In statistics the alternative interpretations enable the analysis of different data using different methods based on different models to achieve slightly different goals. Any statistical comparison of the competing schools considers pragmatic criteria beyond the philosophical.

Major contributors

Two major contributors to frequentist (classical) methods were Fisher and Neyman. Fisher's interpretation of probability was idiosyncratic (but strongly non-Bayesian). Neyman's views were rigorously frequentist. Three major contributors to 20th century Bayesian statistical philosophy, mathematics and methods were de Finetti, Jeffreys and Savage. Savage popularized de Finetti's ideas in the English-speaking world and made Bayesian mathematics rigorous. In 1965, Dennis Lindley's 2-volume work "Introduction to Probability and Statistics from a Bayesian Viewpoint" brought Bayesian methods to a wide audience. Statistics has advanced over the past three generations; The "authoritative" views of the early contributors are not all current.

Contrasting approaches

Frequentist inference

Frequentist inference is partially and tersely described above in (Fisher's "significance testing" vs Neyman–Pearson "hypothesis testing"). Frequentist inference combines several different views. The result is capable of supporting scientific conclusions, making operational decisions and estimating parameters with or without confidence intervals. Frequentist inference is based solely on the (one set of) evidence.

Bayesian inference

A classical frequency distribution describes the probability of the data. The use of Bayes' theorem allows a more abstract concept – the probability of a hypothesis (corresponding to a theory) given the data. The concept was once known as "inverse probability". Bayesian inference updates the probability estimate for a hypothesis as additional evidence is acquired. Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

Comparisons of characteristics

Frequentists and Bayesians use different models of probability. Frequentists often consider parameters to be fixed but unknown while Bayesians assign probability distributions to similar parameters. Consequently, Bayesians speak of probabilities that don't exist for frequentists; A Bayesian speaks of the probability of a theory while a true frequentist can speak only of the consistency of the evidence with the theory. Example: A frequentist does not say that there is a 95% probability that the true value of a parameter lies within a confidence interval, saying instead that 95% of confidence intervals contain the true value.

Efron's comparative adjectives

Bayes Frequentist
  • Basis
  • Resulting Characteristic
  • _
  • Ideal Application
  • Target Audience
  • Modeling Characteristic
  • Belief (prior)
  • Principled Philosophy
  • One distribution
  • Dynamic (repeated sampling)
  • Individual (subjective)
  • Aggressive
  • Behavior (method)
  • Opportunistic Methods
  • Many distributions (bootstrap?)
  • Static (one sample)
  • Community (objective)
  • Defensive

Alternative comparison

Bayesian Frequentist
Strengths
  • Complete
  • Coherent
  • Prescriptive
  • _
  • _
  • _
  • _
  • _
  • Strong inference from model
  • Inferences well calibrated
  • No need to specify prior distributions
  • Flexible range of procedures
    • Unbiasness, sufficiency, ancillarity...
    • Widely applicable and dependable
    • Asymptotic theory
    • Easy to interpret
    • Can be calculated by hand
  • Strong model formulation & assessment
Weaknesses
  • Too subjective for scientific inference
  • Denies the role of randomization for design
  • Requires and relies on full specification of a model (likelihood and prior)
  • _
  • _
  • _
  • Weak model formulation & assessment
  • Incomplete
  • Ambiguous
  • Incoherent
  • Not prescriptive
  • No unified theory
  • (Over?)emphasis on asymptotic properties
  • Weak inference from model

Mathematical results

Neither school is immune from mathematical criticism and neither accepts it without a struggle. Stein's paradox (for example) illustrated that finding a "flat" or "uninformative" prior probability distribution in high dimensions is subtle. Bayesians regard that as peripheral to the core of their philosophy while finding frequentism to be riddled with inconsistencies, paradoxes and bad mathematical behavior. Frequentists can explain most. Some of the "bad" examples are extreme situations - such as estimating the weight of a herd of elephants from measuring the weight of one ("Basu's elephants"), which allows no statistical estimate of the variability of weights. The likelihood principle has been a battleground.

Statistical results

Both schools have achieved impressive results in solving real-world problems. Classical statistics effectively has the longer record because numerous results were obtained with mechanical calculators and printed tables of special statistical functions. Bayesian methods have been highly successful in the analysis of information that is naturally sequentially sampled (radar and sonar). Many Bayesian methods and some recent frequentist methods (such as the bootstrap) require the computational power widely available only in the last several decades.

There is active discussion about combining Bayesian and frequentist methods, but reservations are expressed about the meaning of the results and reducing the diversity of approaches.

Philosophical results

Bayesians are united in opposition to the limitations of frequentism, but are philosophically divided into numerous camps (empirical, hierarchical, objective, personal, subjective), each with a different emphasis. One (frequentist) philosopher of statistics has noted a retreat from the statistical field to philosophical probability interpretations over the last two generations. There is a perception that successes in Bayesian applications do not justify the supporting philosophy. Bayesian methods often create useful models that are not used for traditional inference and which owe little to philosophy. None of the philosophical interpretations of probability (frequentist or Bayesian) appears robust. The frequentist view is too rigid and limiting while the Bayesian view can be simultaneously objective and subjective, etc.

Illustrative quotations

  • "carefully used, the frequentist approach yields broadly applicable if sometimes clumsy answers"
  • "To insist on unbiased [frequentist] techniques may lead to negative (but unbiased) estimates of a variance; the use of p-values in multiple tests may lead to blatant contradictions; conventional 0.95-confidence regions may actually consist of the whole real line. No wonder that mathematicians find it often difficult to believe that conventional statistical methods are a branch of mathematics."
  • "Bayesianism is a neat and fully principled philosophy, while frequentism is a grab-bag of opportunistic, individually optimal, methods."
  • "in multiparameter problems flat priors can yield very bad answers"
  • "[Bayes' rule] says there is a simple, elegant way to combine current information with prior experience in order to state how much is known. It implies that sufficiently good data will bring previously disparate observers to agreement. It makes full use of available information, and it produces decisions having the least possible error rate."
  • "Bayesian statistics is about making probability statements, frequentist statistics is about evaluating probability statements."
  • "[S]tatisticians are often put in a setting reminiscent of Arrow’s paradox, where we are asked to provide estimates that are informative and unbiased and confidence statements that are correct conditional on the data and also on the underlying true parameter." (These are conflicting requirements.)
  • "formal inferential aspects are often a relatively small part of statistical analysis"
  • "The two philosophies, Bayesian and frequentist, are more orthogonal than antithetical."
  • "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure."

Summary

  • Bayesian theory has a mathematical advantage
    • Frequentist probability has existence and consistency problems
    • But, finding good priors to apply Bayesian theory remains (very?) difficult
  • Both theories have impressive records of successful application
  • Neither supporting philosophical interpretation of probability is robust
  • There is increasing skepticism of the connection between application and philosophy
  • Some statisticians are recommending active collaboration (beyond a cease fire)

The likelihood principle

Likelihood is a synonym for probability in common usage. In statistics it is reserved for probabilities that fail to meet the frequentist definition. A probability refers to variable data for a fixed hypothesis while a likelihood refers to variable hypotheses for a fixed set of data. Repeated measurements of a fixed length with a ruler generate a set of observations. Each fixed set of observational conditions is associated with a probability distribution and each set of observations can be interpreted as a sample from that distribution – the frequentist view of probability. Alternatively a set of observations may result from sampling any of a number of distributions (each resulting from a set of observational conditions). The probabilistic relationship between a fixed sample and a variable distribution (resulting from a variable hypothesis) is termed likelihood – a Bayesian view of probability. A set of length measurements may imply readings taken by careful, sober, rested, motivated observers in good lighting.
A likelihood is a probability (or not) by another name which exists because of the limited frequentist definition of probability. Likelihood is a concept introduced and advanced by Fisher for more than 40 years (although prior references to the concept exist and Fisher's support was half-hearted). The concept was accepted and substantially changed by Jeffreys. In 1962 Birnbaum "proved" the likelihood principle from premises acceptable to most statisticians. The "proof" has been disputed by statisticians and philosophers. The principle says that all of the information in a sample is contained in the likelihood function, which is accepted as a valid probability distribution by Bayesians (but not by frequentists).

Some (frequentist) significance tests are not consistent with the likelihood principle. Bayesians accept the principle which is consistent with their philosophy (perhaps encouraged by the discomfiture of frequentists). "[T]he likelihood approach is compatible with Bayesian statistical inference in the sense that the posterior Bayes distribution for a parameter is, by Bayes's Theorem, found by multiplying the prior distribution by the likelihood function." Frequentists interpret the principle adversely to Bayesians as implying no concern about the reliability of evidence. "The likelihood principle of Bayesian statistics implies that information about the experimental design from which evidence is collected does not enter into the statistical analysis of the data." Many Bayesians (Savage for example) recognize that implication as a vulnerability.

The likelihood principle has become an embarrassment to both major philosophical schools of statistics; It has weakened both rather than favoring either. Its strongest supporters claim that it offers a better foundation for statistics than either of the two schools. "[L]ikelihood looks very good indeed when it is compared with these [Bayesian and frequentist] alternatives." These supporters include statisticians and philosophers of science. The concept needs further development before it can be regarded as a serious challenge to either existing school, but it seems to offer a promising compromise position. While Bayesians acknowledge the importance of likelihood for calculation, they believe that the posterior probability distribution is the proper basis for inference.

Modeling

Inferential statistics is based on statistical models. Much of classical hypothesis testing, for example, was based on the assumed normality of the data. Robust and nonparametric statistics were developed to reduce the dependence on that assumption. Bayesian statistics interprets new observations from the perspective of prior knowledge – assuming a modeled continuity between past and present. The design of experiments assumes some knowledge of those factors to be controlled, varied, randomized and observed. Statisticians are well aware of the difficulties in proving causation (more of a modeling limitation than a mathematical one), saying "correlation does not imply causation".
More complex statistics utilizes more complex models, often with the intent of finding a latent structure underlying a set of variables. As models and data sets have grown in complexity, foundational questions have been raised about the justification of the models and the validity of inferences drawn from them. The range of conflicting opinion expressed about modeling is large.
  • Models can be based on scientific theory or on ad-hoc data analysis. The approaches use different methods. There are advocates of each.
  • Model complexity is a compromise. The Akaikean information criterion and Bayesian information criterion are two less subjective approaches to achieving that compromise.
  • Fundamental reservations have been expressed about even simple regression models used in the social sciences. A long list of assumptions inherent to the validity of a model is typically neither mentioned nor checked. A favorable comparison between observations and model is often considered sufficient.
  • Bayesian statistics focuses so tightly on the posterior probability that it ignores the fundamental comparison of observations and model.
  • Traditional observation-based models are inadequate to solve many important problems. A much wider range of models, including algorithmic models, must be utilized. "If the model is a poor emulation of nature, the conclusions may be wrong."
  • Modeling is often poorly done (the wrong methods are used) and poorly reported.
In the absence of a strong philosophical consensus review of statistical modeling, many statisticians accept the cautionary words of statistician George Box, "All models are wrong, but some are useful."

Other reading

For a short introduction to the foundations of statistics, see ch. 8 ("Probability and statistical inference") of Kendall's Advanced Theory of Statistics (6th edition, 1994).

In his book Statistics As Principled Argument, Robert P. Abelson articulates the position that statistics serves as a standardized means of settling disputes between scientists who could otherwise each argue the merits of their own positions ad infinitum. From this point of view, statistics is a form of rhetoric; as with any means of settling disputes, statistical methods can succeed only as long as all parties agree on the approach used.

Introduction to entropy

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Introduct...