In statistics a population proportion, generally denoted by or the Greek letter, is a parameter that describes a percentage value associated with a population. A census
can be conducted to determine the actual value of a population
parameter, but often a census is not practical due to its costs and time
consumption. For example, the 2010 United States census showed that 83.7% of the American population was identified as not being Hispanic or Latino;
the value of .837 is a population proportion. In general, the
population proportion and other population parameters are unknown.
A population proportion is usually estimated through an unbiased sample statistic obtained from an observational study or experiment, resulting in a sample proportion, generally denoted by and in some textbooks by . For example, the National Technological Literacy Conference conducted a
national survey of 2,000 adults to determine the percentage of adults
who are economically illiterate; the study showed that 1,440 out of the
2,000 adults sampled did not understand what a gross domestic product is. The value of 72% (or 1440/2000) is a sample proportion.
Mathematical definition
A Venn Diagram illustration of a set and its subset . The proportion can be calculated by measuring how much of is in .
A proportion is mathematically defined as being the ratio of the quantity of elements (a countable quantity) in a subset to the size of a set :
where is the count of successes in the population, and is the size of the population.
This mathematical definition can be generalized to provide the definition for the sample proportion:
where is the count of successes in the sample, and is the size of the sample obtained from the population.
One of the main focuses of study in inferential statistics
is determining the "true" value of a parameter. Generally the actual
value for a parameter will never be found, unless a census is conducted
on the population of study. However, there are statistical methods that
can be used to get a reasonable estimation for a parameter. These
methods include confidence intervals and hypothesis testing.
Estimating the value of a population proportion can be of great
implication in the areas of agriculture, business, economics, education,
engineering, environmental studies, medicine, law, political science,
psychology, and sociology.
A population proportion can be estimated through the usage of a confidence interval known as a one-sample proportion in the Z-interval whose formula is given below:
To derive the formula for the one-sample proportion in the Z-interval, a sampling distribution
of sample proportions needs to be taken into consideration. The mean of
the sampling distribution of sample proportions is usually denoted as and its standard deviation is denoted as:
Since the value of is unknown, an unbiased statistic will be used for . The mean and standard deviation are rewritten respectively as:
and
Invoking the central limit theorem, the sampling distribution of sample proportions is approximately normal—provided that the sample is reasonably large and unskewed.
Suppose the following probability is calculated:
,
where and are the standard critical values.
The
sampling distribution of sample proportions is approximately normal
when it satisfies the requirements of the Central Limit Theorem.
From the algebraic work done above, it is evident from a level of certainty that could fall in between the values of:
.
Conditions for inference
In general the formula used for estimating a population proportion
requires substitutions of known numerical values. However, these
numerical values cannot be "blindly" substituted into the formula
because statistical inference
requires that the estimation of an unknown parameter be justifiable.
For a parameter's estimation to be justifiable, there are three
conditions that need to be verified:
The data's individual observation have to be obtained from a simple random sample of the population of interest.
The data's individual observations have to display normality. This can be assumed mathematically with the following definition:
Let be the sample size of a given random sample and let be its sample proportion. If and , then the data's individual observations display normality.
The data's individual observations have to be independent of each other. This can be assumed mathematically with the following definition:
Let be the size of the population of interest and let be the sample size of a simple random sample of the population. If , then the data's individual observations are independent of each other.
The conditions for SRS, normality, and independence are sometimes referred to as the conditions for the inference tool box in most statistical textbooks.
Example
Suppose a presidential election is taking place in a democracy. A
random sample of 400 eligible voters in the democracy's voter population
shows that 272 voters support candidate B. A political scientist wants
to determine what percentage of the voter population support candidate
B.
To answer the political scientist's question, a one-sample
proportion in the Z-interval with a confidence level of 95% can be
constructed in order to determine the population proportion of eligible
voters in this democracy that support candidate B.
Solution
It is known from the random sample that with sample size . Before a confidence interval is constructed, the conditions for inference will be verified.
Since a random sample of 400 voters was obtained from the voting
population, the condition for a simple random sample has been met.
Let and , it will be checked whether and
and
The condition for normality has been met.
Let be the size of the voter population in this democracy, and let . If , then there is independence.
The population size for this democracy's voters can be assumed to be at least 4,000. Hence, the condition for independence has been met.
With the conditions for inference verified, it is permissible to construct a confidence interval.
The standard normal curve with which gives an upper tail area of 0.0250 and an area of 0.9750 for .A table with standard normal probabilities for .
By examining a standard normal bell curve, the value for
can be determined by identifying which standard score gives the
standard normal curve an upper tail area of 0.0250 or an area of 1 –
0.0250 = 0.9750. The value for can also be found through a table of standard normal probabilities.
From a table of standard normal probabilities, the value of that gives an area of 0.9750 is 1.96. Hence, the value for is 1.96.
The values for , , can now be substituted into the formula for one-sample proportion in the Z-interval:
Based on the conditions of inference and the formula for the
one-sample proportion in the Z-interval, it can be concluded with a 95%
confidence level that the percentage of the voter population in this
democracy supporting candidate B is between 63.429% and 72.571%.
Value of the parameter in the confidence interval range
A commonly asked question in inferential statistics is whether the
parameter is included within a confidence interval. The only way to
answer this question is for a census to be conducted. Referring to the
example given above, the probability that the population proportion is
in the range of the confidence interval is either 1 or 0. That is, the
parameter is included in the interval range or it is not. The main
purpose of a confidence interval is to better illustrate what the ideal
value for a parameter could possibly be.
Common errors and misinterpretations from estimation
A very common error that arises from the construction of a confidence
interval is the belief that the level of confidence, such as ,
means 95% chance. This is incorrect. The level of confidence is based
on a measure of certainty, not probability. Hence, the values of fall between 0 and 1, exclusively.
Estimation of P using ranked set sampling
A more precise estimate of P can be obtained by choosing ranked set sampling instead of simple random sampling.
In computer science, a universal Turing machine (UTM) is a Turing machine capable of computing any computable sequence, as described by Alan Turing in his seminal paper "On Computable Numbers, with an Application to the Entscheidungsproblem". Common sense might say that a universal machine is impossible, but Turing proves that it is possible. He suggested that we may compare a human in the process of computing a
real number to a machine that is only capable of a finite number of
conditions ; which will be called "m-configurations". He then described the operation of such machine, as described below, and argued:
It is my contention that these operations include all those which are used in the computation of a number.
Turing introduced the idea of such a machine in 1936–1937.
Martin Davis
makes a persuasive argument that Turing's conception of what is now
known as "the stored-program computer", of placing the "action
table"—the instructions for the machine—in the same "memory" as the
input data, strongly influenced John von Neumann's conception of the first American discrete-symbol (as opposed to analog) computer—the EDVAC. Davis quotes Time
magazine to this effect, that "everyone who taps at a keyboard ... is
working on an incarnation of a Turing machine", and that "John von
Neumann [built] on the work of Alan Turing".
Davis makes a case that Turing's Automatic Computing Engine (ACE) computer "anticipated" the notions of microprogramming (microcode) and RISC processors. Donald Knuth cites Turing's work on the ACE computer as designing "hardware to facilitate subroutine linkage"; Davis also references this work as Turing's use of a hardware "stack".
As the Turing machine was encouraging the construction of computers, the UTM was encouraging the development of the fledgling computer sciences. An early, if not the first, assembler was proposed "by a young hot-shot programmer" for the EDVAC. Von Neumann's "first serious program ... [was] to simply sort data efficiently". Knuth observes that the subroutine return embedded in the program
itself rather than in special registers is attributable to von Neumann
and Goldstine. Knuth furthermore states that
The first interpretive routine may
be said to be the "Universal Turing Machine" ... Interpretive routines
in the conventional sense were mentioned by John Mauchly in his lectures at the Moore School
in 1946 ... Turing took part in this development also; interpretive
systems for the Pilot ACE computer were written under his direction.
Davis briefly mentions operating systems and compilers as outcomes of the notion of program-as-data.
Mathematical theory
With this encoding of action tables as strings, it becomes possible,
in principle, for Turing machines to answer questions about the
behaviour of other Turing machines. Most of these questions, however,
are undecidable,
meaning that the function in question cannot be calculated
mechanically. For instance, the problem of determining whether an
arbitrary Turing machine will halt on a particular input, or on all
inputs, known as the Halting problem, was shown to be, in general, undecidable in Turing's original paper. Rice's theorem shows that any non-trivial question about the output of a Turing machine is undecidable.
A universal Turing machine can calculate any recursive function, decide any recursive language, and accept any recursively enumerable language. According to the Church–Turing thesis, the problems solvable by a universal Turing machine are exactly those problems solvable by an algorithm or an effective method of computation,
for any reasonable definition of those terms. For these reasons, a
universal Turing machine serves as a standard against which to compare
computational systems, and a system that can simulate a universal Turing
machine is called Turing complete.
An abstract version of the universal Turing machine is the universal function, a computable function that can be used to calculate any other computable function. The UTM theorem proves the existence of such a function.
Efficiency
Without loss of generality, the input of Turing machine can be
assumed to be in the alphabet {0, 1}; any other finite alphabet can be
encoded over {0, 1}. The behavior of a Turing machine M is
determined by its transition function. This function can be easily
encoded as a string over the alphabet {0, 1} as well. The size of the
alphabet of M, the number of tapes it has, and the size of the
state space can be deduced from the transition function's table. The
distinguished states and symbols can be identified by their position,
e.g. the first two states can by convention be the start and stop
states. Consequently, every Turing machine can be encoded as a string
over the alphabet {0, 1}. Additionally, we convene that every invalid
encoding maps to a trivial Turing machine that immediately halts, and
that every Turing machine can have an infinite number of encodings by
padding the encoding with an arbitrary number of (say) 1's at the end,
just like comments work in a programming language. It should be no
surprise that we can achieve this encoding given the existence of a Gödel number and computational equivalence between Turing machines and μ-recursive functions. Similarly, our construction associates to every binary string α, a Turing machine Mα.
Starting from the above encoding, in 1966 F. C. Hennie and R. E. Stearns showed that given a Turing machine Mα that halts on input x within N steps, then there exists a multi-tape universal Turing machine that halts on inputs α, x (given on different tapes) in CN log N, where C is a machine-specific constant that does not depend on the length of the input x, but does depend on M's alphabet size, number of tapes, and number of states. Effectively this is an simulation, using Donald Knuth's Big O notation. The corresponding result for space-complexity rather than time-complexity is that we can simulate in a way that uses at most CN cells at any stage of the computation, an simulation.
Smallest machines
When Alan Turing came up with the idea of a universal machine he had
in mind the simplest computing model powerful enough to calculate all
possible functions that can be calculated. Claude Shannon
first explicitly posed the question of finding the smallest possible
universal Turing machine in 1956. He showed that two symbols were
sufficient so long as enough states were used (or vice versa), and that
it was always possible to exchange states for symbols. He also showed
that no universal Turing machine of one state could exist.
Marvin Minsky discovered a 7-state 4-symbol universal Turing machine in 1962 using 2-tag systems.
Other small universal Turing machines have since been found by Yurii
Rogozhin and others by extending this approach of tag system simulation.
If we denote by (m, n) the class of UTMs with m states and n symbols the following tuples have been found: (15, 2), (9, 3), (6, 4), (5, 5), (4, 6), (3, 9), and (2, 18). Rogozhin's (4, 6) machine uses only 22 instructions, and no standard UTM of lesser descriptional complexity is known.
However, generalizing the standard Turing machine model admits
even smaller UTMs. One such generalization is to allow an infinitely
repeated word on one or both sides of the Turing machine input, thus
extending the definition of universality and known as "semi-weak" or
"weak" universality, respectively. Small weakly universal Turing
machines that simulate the Rule 110 cellular automaton have been given for the (6, 2), (3, 3), and (2, 4) state–symbol pairs. The proof of universality for Wolfram's 2-state 3-symbol Turing machine
further extends the notion of weak universality by allowing certain
non-periodic initial configurations. Other variants on the standard
Turing machine model that yield small UTMs include machines with
multiple tapes or tapes of multiple dimension, and machines coupled with
a finite automaton.
Machines with no internal states
If multiple heads reading successive tape positions are allowed on a
Turing machine then no internal states are required; as "states" can be
encoded in the tape. For example, consider a tape with 6 colours: 0, 1,
2, 0A, 1A, 2A. Consider a tape such as 0,0,1,2,2A,0,2,1 where a 3-headed
Turing machine is situated over the triple (2,2A,0). The rules then
convert any triple to another triple and move the 3-heads left or right.
For example, the rules might convert (2,2A,0) to (2,1,0) and move the
head left. Thus in this example, the machine acts like a 3-colour Turing
machine with internal states A and B (represented by no letter). The
case for a 2-headed Turing machine is very similar. Thus a 2-headed
Turing machine without internal states can be universal with 6 colours.
It is not known what the smallest number of colours needed for a
multi-headed Turing machine is or if a 2-colour universal Turing machine
without internal states is possible with multiple heads. It also means
that rewrite rules
are Turing complete since the triple rules are equivalent to rewrite
rules. Extending the tape to two dimensions with a head sampling a
letter and its 8 neighbours, only 2 colours are needed, as for example, a
colour can be encoded in a vertical triple pattern such as 110.
Also, if the distance between the two heads is variable (the tape has "slack" between the heads), then it can simulate any Post tag system, some of which are universal.
Example of coding
For those who would undertake the challenge of designing a UTM exactly as Turing specified see the article by Davies in Copeland (2004).
Davies corrects the errors in the original and shows what a sample run
would look like. He successfully ran a (somewhat simplified) simulation.
Turing used seven symbols { A, C, D, R, L, N, ; } to encode each 5-tuple; as described in the article Turing machine, his 5-tuples are only of types N1, N2, and N3. The number of each "m‑configuration"
(instruction, state) is represented by "D" followed by a unary string
of A's, e.g. "q3" = DAAA. In a similar manner, he encodes the symbols
blank as "D", the symbol "0" as "DC", the symbol "1" as DCC, etc. The
symbols "R", "L", and "N" remain as is.
After encoding each 5-tuple is then "assembled" into a string in order as shown in the following table:
Current m‑configuration
Tape symbol
Print-operation
Tape-motion
Final m‑configuration
Current m‑configuration code
Tape symbol code
Print-operation code
Tape-motion code
Final m‑configuration code
5-tuple assembled code
q1
blank
P0
R
q2
DA
D
DC
R
DAA
DADDCRDAA
q2
blank
E
R
q3
DAA
D
D
R
DAAA
DAADDRDAAA
q3
blank
P1
R
q4
DAAA
D
DCC
R
DAAAA
DAAADDCCRDAAAA
q4
blank
E
R
q1
DAAAA
D
D
R
DA
DAAAADDRDA
Finally, the codes for all four 5-tuples are strung together into a code started by ";" and separated by ";" i.e.:
;DADDCRDAA;DAADDRDAAA;DAAADDCCRDAAAA;DAAAADDRDA
This code he placed on alternate squares—the "F-squares" – leaving
the "E-squares" (those liable to erasure) empty. The final assembly of
the code on the tape for the U-machine consists of placing two special
symbols ("e") one after the other, then the code separated out on
alternate squares, and lastly the double-colon symbol "::" (blanks shown here with "." for clarity):
The U-machine's action-table (state-transition table) is responsible
for decoding the symbols. Turing's action table keeps track of its place
with markers "u", "v", "x", "y", "z" by placing them in "E-squares" to
the right of "the marked symbol" – for example, to mark the current
instruction z is placed to the right of ";" x is keeping
the place with respect to the current "m‑configuration" DAA. The
U-machine's action table will shuttle these symbols around (erasing them
and placing them in different locations) as the computation progresses:
Turing's action-table for his U-machine is very involved.
Roger Penrose
provides examples of ways to encode instructions for the universal
machine using only binary symbols { 0, 1 }, or { blank, mark | }.
Penrose goes further and writes out his entire U-machine code. He
asserts that it truly is a U-machine code, an enormous number that spans
almost 2 full pages of 1's and 0's.
Asperti and Ricciotti described a multi-tape UTM defined by
composing elementary machines with very simple semantics, rather than
explicitly giving its full action table. This approach was sufficiently
modular to allow them to formally prove the correctness of the machine
in the Matitaproof assistant.
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.
Inferential statistics can be contrasted with descriptive statistics.
Descriptive statistics is solely concerned with properties of the
observed data, and it does not rest on the assumption that the data come
from a larger population. In machine learning, the term inference is sometimes used instead to mean "make a prediction, by evaluating an already trained model"; in this context inferring properties of the model is referred to as training or learning (rather than inference), and using a model for prediction is referred to as inference (instead of prediction); see also predictive inference.
Introduction
Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.
Konishi and Kitagawa state "The majority of the problems in
statistical inference can be considered to be problems related to
statistical modeling". Relatedly, Sir David Cox
has said, "How [the] translation from subject-matter problem to
statistical model is done is often the most critical part of an
analysis".
The conclusion of a statistical inference is a statistical proposition. Some common forms of statistical proposition are the following:
a point estimate, i.e. a particular value that best approximates some parameter of interest;
an interval estimate, e.g. a confidence interval (or set estimate). A confidence interval
is an interval constructed using data from a sample, such that if the
procedure were repeated over many independent samples (mathematically,
by taking the limit), a fixed proportion (e.g., 95% for a 95% confidence
interval) of the resulting intervals would contain the true value of
the parameter, i.e., the population parameter;
a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
Any statistical inference requires some assumptions. A statistical model
is a set of assumptions concerning the generation of the observed data
and similar data. Descriptions of statistical models usually emphasize
the role of population quantities of interest, about which we wish to
draw inference. Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.
Degree of models/assumptions
Statisticians distinguish between three levels of modeling assumptions:
Fully parametric:
The probability distributions describing the data-generation process
are assumed to be fully described by a family of probability
distributions involving only a finite number of unknown parameters. For example, one may assume that the distribution of population values
is truly Normal, with unknown mean and variance, and that datasets are
generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models.
Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal. For example, every continuous probability distribution has a median, which may be estimated using the sample median or the Hodges–Lehmann–Sen estimator, which has good properties when the data arise from simple random sampling.
Semi-parametric:
This term typically implies assumptions 'in between' fully and
non-parametric approaches. For example, one may assume that a population
distribution has a finite mean. Furthermore, one may assume that the
mean response level in the population depends in a truly linear manner
on some covariate (a parametric assumption) but not make any parametric
assumption describing the variance around that mean (i.e. about the
presence or possible form of any heteroscedasticity).
More generally, semi-parametric models can often be separated into
'structural' and 'random variation' components. One component is treated
parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.
The
above image shows a histogram assessing the assumption of normality,
which can be illustrated through the even spread underneath the bell
curve.
Whatever level of assumption is made, correctly calibrated inference,
in general, requires these assumptions to be correct; i.e. that the
data-generating mechanisms really have been correctly specified.
Incorrect assumptions of 'simple' random sampling can invalidate statistical inference. More complex semi- and fully parametric assumptions are also cause for
concern. For example, incorrectly assuming the Cox model can in some
cases lead to faulty conclusions. Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference. The use of any
parametric model is viewed skeptically by most experts in sampling
human populations: "most sampling statisticians, when they deal with
confidence intervals at all, limit themselves to statements about
[estimators] based on very large samples, where the central limit
theorem ensures that these [estimators] will have distributions that are
nearly normal." In particular, a normal distribution "would be a totally unrealistic
and catastrophically unwise assumption to make if we were dealing with
any kind of economic population." Here, the central limit theorem states that the distribution of the
sample mean "for very large samples" is approximately normally
distributed, if the distribution is not heavy-tailed.
Given the difficulty in specifying exact distributions of sample
statistics, many methods have been developed for approximating these.
With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the Berry–Esseen theorem. Yet for many practical purposes, the normal approximation provides a
good approximation to the sample-mean's distribution when there are 10
(or more) independent samples, according to simulation studies and
statisticians' experience. Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler divergence, Bregman divergence, and the Hellinger distance.
With indefinitely large samples, limiting results like the central limit theorem
describe the sample statistic's limiting distribution if one exists.
Limiting results are not statements about finite samples, and indeed are
irrelevant to finite samples. However, the asymptotic theory of limiting distributions is often
invoked for work with finite samples. For example, limiting results are
often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics.
The magnitude of the difference between the limiting distribution and
the true distribution (formally, the 'error' of the approximation) can
be assessed using simulation. The heuristic application of limiting results to finite samples is
common practice in many applications, especially with low-dimensional models with log-concavelikelihoods (such as with one-parameter exponential families).
For a given dataset that was produced by a randomization design, the
randomization distribution of a statistic (under the null-hypothesis) is
defined by evaluating the test statistic
for all of the plans that could have been generated by the
randomization design. In frequentist inference, the randomization allows
inferences to be based on the randomization distribution rather than a
subjective model, and this is important especially in survey sampling
and design of experiments. Statistical inference from randomized studies is also more straightforward than many other situations. In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.
Objective randomization allows properly inductive procedures. Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures. (However, it is true that in fields of science with developed
theoretical knowledge and experimental control, randomized experiments
may increase the costs of experimentation without improving the quality
of inferences.) Similarly, results from randomized experiments
are recommended by leading statistical authorities as allowing
inferences with greater reliability than do observational studies of the
same phenomena. However, a good observational study may be better than a bad randomized experiment.
The statistical analysis of a randomized experiment may be based
on the randomization scheme stated in the experimental protocol and does
not need a subjective model.
However, at any time, some hypotheses cannot be tested using
objective statistical models, which accurately describe randomized
experiments or random samples. In some cases, such randomized studies
are uneconomical or unethical.
Model-based analysis of randomized experiments
It is standard practice to refer to a statistical model, e.g., a
linear or logistic models, when analyzing data from randomized
experiments. However, the randomization scheme guides the choice of a statistical
model. It is not possible to choose an appropriate model without knowing
the randomization scheme. Seriously misleading results can be obtained analyzing data from
randomized experiments while ignoring the experimental protocol; common
mistakes include forgetting the blocking used in an experiment and
confusing repeated measurements on the same experimental unit with
independent replicates of the treatment applied to different
experimental units.
Model-free randomization inference
Model-free techniques provide a complement to model-based methods,
which employ reductionist strategies of reality-simplification. The
former combine, evolve, ensemble and train algorithms dynamically
adapting to the contextual affinities of a process and learning the
intrinsic characteristics of the observations.
For example, model-free simple linear regression is based either on:
a random design, where the pairs of observations are independent and identically distributed (iid),
or a deterministic design, where the variables are deterministic, but the corresponding response variables are random and independent with a common conditional distribution, i.e., , which is independent of the index .
In either case, the model-free randomization inference for features of the common conditional distribution
relies on some regularity conditions, e.g. functional smoothness. For
instance, model-free randomization inference for the population feature conditional mean, , can be consistently estimated via local averaging or local polynomial fitting, under the assumption that
is smooth. Also, relying on asymptotic normality or resampling, we can
construct confidence intervals for the population feature, in this case,
the conditional mean, .
Paradigms for inference
Different schools of statistical inference have become established.
These schools—or "paradigms"—are not mutually exclusive, and methods
that work well under one paradigm often have attractive interpretations
under other paradigms.
This paradigm calibrates the plausibility of propositions by
considering (notional) repeated sampling of a population distribution to
produce datasets similar to the one at hand. By considering the
dataset's characteristics under repeated sampling, the frequentist
properties of a statistical proposition can be quantified—although in
practice this quantification may be challenging.
Frequentist inference, objectivity, and decision theory
One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman develops these procedures in terms of pre-experiment probabilities.
That is, before undertaking an experiment, one decides on a rule for
coming to a conclusion such that the probability of being correct is
controlled in a suitable way: such a probability need not have a
frequentist or repeated sampling interpretation. In contrast, Bayesian
inference works in terms of conditional probabilities (i.e.
probabilities conditional on the observed data), compared to the
marginal (but conditioned on unknown parameters) probabilities used in
the frequentist approach.
The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions. In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions,
which play the role of (negative) utility functions. Loss functions
need not be explicitly stated for statistical theorists to prove that a
statistical procedure has an optimality property. However, loss-functions are often useful for stating optimality
properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss.
While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic
to be used, the absence of obviously explicit utilities and prior
distributions has helped frequentist procedures to become widely viewed
as 'objective'.
The Bayesian calculus describes degrees of belief using the
'language' of probability; beliefs are positive, integrate into one, and
obey probability axioms. Bayesian inference uses the available
posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach.
Bayesian inference, subjectivity and decision theory
Many informal Bayesian inferences are based on "intuitively
reasonable" summaries of the posterior. For example, the posterior mean,
median and mode, highest posterior density intervals, and Bayes Factors
can all be motivated in this way. While a user's utility function
need not be stated for this sort of inference, these summaries do all
depend (to some extent) on stated prior beliefs, and are generally
viewed as subjective conclusions. (Methods of prior construction which
do not require external input have been proposed but not yet fully developed.)
Formally, Bayesian inference is calibrated with reference to an
explicitly stated utility, or loss function; the 'Bayes rule' is the one
which maximizes expected utility, averaged over the posterior
uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic
sense. Given assumptions, data and utility, Bayesian inference can be
made for essentially any problem, although not every statistical
inference need have a Bayesian interpretation. Analyses which are not
formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e. those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.
Likelihood-based inference is a paradigm used to estimate the parameters of a statistical model based on observed data. Likelihoodism approaches statistics by using the likelihood function, denoted as , quantifies the probability of observing the given data , assuming a specific set of parameter values .
In likelihood-based inference, the goal is to find the set of parameter
values that maximizes the likelihood function, or equivalently,
maximizes the probability of observing the given data.
The process of likelihood-based inference usually involves the following steps:
Formulating the statistical model: A statistical model is
defined based on the problem at hand, specifying the distributional
assumptions and the relationship between the observed data and the
unknown parameters. The model can be simple, such as a normal
distribution with known variance, or complex, such as a hierarchical
model with multiple levels of random effects.
Constructing the likelihood function: Given the statistical model,
the likelihood function is constructed by evaluating the joint
probability density or mass function of the observed data as a function
of the unknown parameters. This function represents the probability of
observing the data for different values of the parameters.
Maximizing the likelihood function: The next step is to find the set
of parameter values that maximizes the likelihood function. This can be
achieved using optimization techniques such as numerical optimization
algorithms. The estimated parameter values, often denoted as , are the maximum likelihood estimates (MLEs).
Assessing uncertainty: Once the MLEs are obtained, it is crucial to
quantify the uncertainty associated with the parameter estimates. This
can be done by calculating standard errors, confidence intervals, or conducting hypothesis tests based on asymptotic theory or simulation techniques such as bootstrapping.
Model checking: After obtaining the parameter estimates and
assessing their uncertainty, it is important to assess the adequacy of
the statistical model. This involves checking the assumptions made in
the model and evaluating the fit of the model to the data using
goodness-of-fit tests, residual analysis, or graphical diagnostics.
Inference and interpretation: Finally, based on the estimated
parameters and model assessment, statistical inference can be performed.
This involves drawing conclusions about the population parameters,
making predictions, or testing hypotheses based on the estimated model.
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models
for a given set of data. Given a collection of models for the data, AIC
estimates the quality of each model, relative to each of the other
models. Thus, AIC provides a means for model selection.
AIC is founded on information theory:
it offers an estimate of the relative information lost when a given
model is used to represent the process that generated the data. (In
doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
The minimum description length (MDL) principle has been developed from ideas in information theory and the theory of Kolmogorov complexity. The (MDL) principle selects statistical models that maximally compress
the data; inference proceeds without assuming counterfactual or
non-falsifiable "data-generating mechanisms" or probability models for the data, as might be done in frequentist or Bayesian approaches.
However, if a "data generating mechanism" does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of the data, on average and asymptotically. In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropyBayesian priors).
However, MDL avoids assuming that the underlying probability model is
known; the MDL principle can also be applied without assumptions that
e.g. the data arose from independent sampling.
Fiducial inference was an approach to statistical inference based on fiducial probability,
also known as a "fiducial distribution". In subsequent work, this
approach has been called ill-defined, extremely limited in
applicability, and even fallacious. However this argument is the same as that which shows that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals,
it does not necessarily invalidate conclusions drawn from fiducial
arguments. An attempt was made to reinterpret the early work of Fisher's
fiducial argument as a special case of an inference theory using upper and lower probabilities.
Structural inference
Developing ideas of Fisher and of Pitman from 1938 to 1939, George A. Barnard developed "structural inference" or "pivotal inference", an approach using invariant probabilities on group families.
Barnard reformulated the arguments behind fiducial inference on a
restricted class of models on which "fiducial" procedures would be
well-defined and useful. Donald A. S. Fraser developed a general theory for structural inference based on group theory and applied this to linear models. The theory formulated by Fraser has close links to decision theory and
Bayesian statistics and can provide optimal frequentist decision rules
if they exist.
Inference topics
The topics below are usually included in the area of statistical inference.
Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations.
Initially, predictive inference was based on observable parameters and it was the main purpose of studying probability, but it fell out of favor in the 20th century due to a new parametric approach pioneered by Bruno de Finetti. The approach modeled phenomena as a physical system observed with error (e.g., celestial mechanics). De Finetti's idea of exchangeability—that
future observations should behave like past observations—came to the
attention of the English-speaking world with the 1974 translation from
French of his 1937 paper, and has since been propounded by such statisticians as Seymour Geisser.