In computer science, a universal Turing machine (UTM) is a Turing machine capable of computing any computable sequence, as described by Alan Turing in his seminal paper "On Computable Numbers, with an Application to the Entscheidungsproblem". Common sense might say that a universal machine is impossible, but Turing proves that it is possible. He suggested that we may compare a human in the process of computing a
real number to a machine that is only capable of a finite number of
conditions ; which will be called "m-configurations". He then described the operation of such machine, as described below, and argued:
It is my contention that these operations include all those which are used in the computation of a number.
Turing introduced the idea of such a machine in 1936–1937.
Martin Davis
makes a persuasive argument that Turing's conception of what is now
known as "the stored-program computer", of placing the "action
table"—the instructions for the machine—in the same "memory" as the
input data, strongly influenced John von Neumann's conception of the first American discrete-symbol (as opposed to analog) computer—the EDVAC. Davis quotes Time
magazine to this effect, that "everyone who taps at a keyboard ... is
working on an incarnation of a Turing machine", and that "John von
Neumann [built] on the work of Alan Turing".
Davis makes a case that Turing's Automatic Computing Engine (ACE) computer "anticipated" the notions of microprogramming (microcode) and RISC processors. Donald Knuth cites Turing's work on the ACE computer as designing "hardware to facilitate subroutine linkage"; Davis also references this work as Turing's use of a hardware "stack".
As the Turing machine was encouraging the construction of computers, the UTM was encouraging the development of the fledgling computer sciences. An early, if not the first, assembler was proposed "by a young hot-shot programmer" for the EDVAC. Von Neumann's "first serious program ... [was] to simply sort data efficiently". Knuth observes that the subroutine return embedded in the program
itself rather than in special registers is attributable to von Neumann
and Goldstine. Knuth furthermore states that
The first interpretive routine may
be said to be the "Universal Turing Machine" ... Interpretive routines
in the conventional sense were mentioned by John Mauchly in his lectures at the Moore School
in 1946 ... Turing took part in this development also; interpretive
systems for the Pilot ACE computer were written under his direction.
Davis briefly mentions operating systems and compilers as outcomes of the notion of program-as-data.
Mathematical theory
With this encoding of action tables as strings, it becomes possible,
in principle, for Turing machines to answer questions about the
behaviour of other Turing machines. Most of these questions, however,
are undecidable,
meaning that the function in question cannot be calculated
mechanically. For instance, the problem of determining whether an
arbitrary Turing machine will halt on a particular input, or on all
inputs, known as the Halting problem, was shown to be, in general, undecidable in Turing's original paper. Rice's theorem shows that any non-trivial question about the output of a Turing machine is undecidable.
A universal Turing machine can calculate any recursive function, decide any recursive language, and accept any recursively enumerable language. According to the Church–Turing thesis, the problems solvable by a universal Turing machine are exactly those problems solvable by an algorithm or an effective method of computation,
for any reasonable definition of those terms. For these reasons, a
universal Turing machine serves as a standard against which to compare
computational systems, and a system that can simulate a universal Turing
machine is called Turing complete.
An abstract version of the universal Turing machine is the universal function, a computable function that can be used to calculate any other computable function. The UTM theorem proves the existence of such a function.
Efficiency
Without loss of generality, the input of Turing machine can be
assumed to be in the alphabet {0, 1}; any other finite alphabet can be
encoded over {0, 1}. The behavior of a Turing machine M is
determined by its transition function. This function can be easily
encoded as a string over the alphabet {0, 1} as well. The size of the
alphabet of M, the number of tapes it has, and the size of the
state space can be deduced from the transition function's table. The
distinguished states and symbols can be identified by their position,
e.g. the first two states can by convention be the start and stop
states. Consequently, every Turing machine can be encoded as a string
over the alphabet {0, 1}. Additionally, we convene that every invalid
encoding maps to a trivial Turing machine that immediately halts, and
that every Turing machine can have an infinite number of encodings by
padding the encoding with an arbitrary number of (say) 1's at the end,
just like comments work in a programming language. It should be no
surprise that we can achieve this encoding given the existence of a Gödel number and computational equivalence between Turing machines and μ-recursive functions. Similarly, our construction associates to every binary string α, a Turing machine Mα.
Starting from the above encoding, in 1966 F. C. Hennie and R. E. Stearns showed that given a Turing machine Mα that halts on input x within N steps, then there exists a multi-tape universal Turing machine that halts on inputs α, x (given on different tapes) in CN log N, where C is a machine-specific constant that does not depend on the length of the input x, but does depend on M's alphabet size, number of tapes, and number of states. Effectively this is an simulation, using Donald Knuth's Big O notation. The corresponding result for space-complexity rather than time-complexity is that we can simulate in a way that uses at most CN cells at any stage of the computation, an simulation.
Smallest machines
When Alan Turing came up with the idea of a universal machine he had
in mind the simplest computing model powerful enough to calculate all
possible functions that can be calculated. Claude Shannon
first explicitly posed the question of finding the smallest possible
universal Turing machine in 1956. He showed that two symbols were
sufficient so long as enough states were used (or vice versa), and that
it was always possible to exchange states for symbols. He also showed
that no universal Turing machine of one state could exist.
Marvin Minsky discovered a 7-state 4-symbol universal Turing machine in 1962 using 2-tag systems.
Other small universal Turing machines have since been found by Yurii
Rogozhin and others by extending this approach of tag system simulation.
If we denote by (m, n) the class of UTMs with m states and n symbols the following tuples have been found: (15, 2), (9, 3), (6, 4), (5, 5), (4, 6), (3, 9), and (2, 18). Rogozhin's (4, 6) machine uses only 22 instructions, and no standard UTM of lesser descriptional complexity is known.
However, generalizing the standard Turing machine model admits
even smaller UTMs. One such generalization is to allow an infinitely
repeated word on one or both sides of the Turing machine input, thus
extending the definition of universality and known as "semi-weak" or
"weak" universality, respectively. Small weakly universal Turing
machines that simulate the Rule 110 cellular automaton have been given for the (6, 2), (3, 3), and (2, 4) state–symbol pairs. The proof of universality for Wolfram's 2-state 3-symbol Turing machine
further extends the notion of weak universality by allowing certain
non-periodic initial configurations. Other variants on the standard
Turing machine model that yield small UTMs include machines with
multiple tapes or tapes of multiple dimension, and machines coupled with
a finite automaton.
Machines with no internal states
If multiple heads reading successive tape positions are allowed on a
Turing machine then no internal states are required; as "states" can be
encoded in the tape. For example, consider a tape with 6 colours: 0, 1,
2, 0A, 1A, 2A. Consider a tape such as 0,0,1,2,2A,0,2,1 where a 3-headed
Turing machine is situated over the triple (2,2A,0). The rules then
convert any triple to another triple and move the 3-heads left or right.
For example, the rules might convert (2,2A,0) to (2,1,0) and move the
head left. Thus in this example, the machine acts like a 3-colour Turing
machine with internal states A and B (represented by no letter). The
case for a 2-headed Turing machine is very similar. Thus a 2-headed
Turing machine without internal states can be universal with 6 colours.
It is not known what the smallest number of colours needed for a
multi-headed Turing machine is or if a 2-colour universal Turing machine
without internal states is possible with multiple heads. It also means
that rewrite rules
are Turing complete since the triple rules are equivalent to rewrite
rules. Extending the tape to two dimensions with a head sampling a
letter and its 8 neighbours, only 2 colours are needed, as for example, a
colour can be encoded in a vertical triple pattern such as 110.
Also, if the distance between the two heads is variable (the tape has "slack" between the heads), then it can simulate any Post tag system, some of which are universal.
Example of coding
For those who would undertake the challenge of designing a UTM exactly as Turing specified see the article by Davies in Copeland (2004).
Davies corrects the errors in the original and shows what a sample run
would look like. He successfully ran a (somewhat simplified) simulation.
Turing used seven symbols { A, C, D, R, L, N, ; } to encode each 5-tuple; as described in the article Turing machine, his 5-tuples are only of types N1, N2, and N3. The number of each "m‑configuration"
(instruction, state) is represented by "D" followed by a unary string
of A's, e.g. "q3" = DAAA. In a similar manner, he encodes the symbols
blank as "D", the symbol "0" as "DC", the symbol "1" as DCC, etc. The
symbols "R", "L", and "N" remain as is.
After encoding each 5-tuple is then "assembled" into a string in order as shown in the following table:
Current m‑configuration
Tape symbol
Print-operation
Tape-motion
Final m‑configuration
Current m‑configuration code
Tape symbol code
Print-operation code
Tape-motion code
Final m‑configuration code
5-tuple assembled code
q1
blank
P0
R
q2
DA
D
DC
R
DAA
DADDCRDAA
q2
blank
E
R
q3
DAA
D
D
R
DAAA
DAADDRDAAA
q3
blank
P1
R
q4
DAAA
D
DCC
R
DAAAA
DAAADDCCRDAAAA
q4
blank
E
R
q1
DAAAA
D
D
R
DA
DAAAADDRDA
Finally, the codes for all four 5-tuples are strung together into a code started by ";" and separated by ";" i.e.:
;DADDCRDAA;DAADDRDAAA;DAAADDCCRDAAAA;DAAAADDRDA
This code he placed on alternate squares—the "F-squares" – leaving
the "E-squares" (those liable to erasure) empty. The final assembly of
the code on the tape for the U-machine consists of placing two special
symbols ("e") one after the other, then the code separated out on
alternate squares, and lastly the double-colon symbol "::" (blanks shown here with "." for clarity):
The U-machine's action-table (state-transition table) is responsible
for decoding the symbols. Turing's action table keeps track of its place
with markers "u", "v", "x", "y", "z" by placing them in "E-squares" to
the right of "the marked symbol" – for example, to mark the current
instruction z is placed to the right of ";" x is keeping
the place with respect to the current "m‑configuration" DAA. The
U-machine's action table will shuttle these symbols around (erasing them
and placing them in different locations) as the computation progresses:
Turing's action-table for his U-machine is very involved.
Roger Penrose
provides examples of ways to encode instructions for the universal
machine using only binary symbols { 0, 1 }, or { blank, mark | }.
Penrose goes further and writes out his entire U-machine code. He
asserts that it truly is a U-machine code, an enormous number that spans
almost 2 full pages of 1's and 0's.
Asperti and Ricciotti described a multi-tape UTM defined by
composing elementary machines with very simple semantics, rather than
explicitly giving its full action table. This approach was sufficiently
modular to allow them to formally prove the correctness of the machine
in the Matitaproof assistant.
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.
Inferential statistics can be contrasted with descriptive statistics.
Descriptive statistics is solely concerned with properties of the
observed data, and it does not rest on the assumption that the data come
from a larger population. In machine learning, the term inference is sometimes used instead to mean "make a prediction, by evaluating an already trained model"; in this context inferring properties of the model is referred to as training or learning (rather than inference), and using a model for prediction is referred to as inference (instead of prediction); see also predictive inference.
Introduction
Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.
Konishi and Kitagawa state "The majority of the problems in
statistical inference can be considered to be problems related to
statistical modeling". Relatedly, Sir David Cox
has said, "How [the] translation from subject-matter problem to
statistical model is done is often the most critical part of an
analysis".
The conclusion of a statistical inference is a statistical proposition. Some common forms of statistical proposition are the following:
a point estimate, i.e. a particular value that best approximates some parameter of interest;
an interval estimate, e.g. a confidence interval (or set estimate). A confidence interval
is an interval constructed using data from a sample, such that if the
procedure were repeated over many independent samples (mathematically,
by taking the limit), a fixed proportion (e.g., 95% for a 95% confidence
interval) of the resulting intervals would contain the true value of
the parameter, i.e., the population parameter;
a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
Any statistical inference requires some assumptions. A statistical model
is a set of assumptions concerning the generation of the observed data
and similar data. Descriptions of statistical models usually emphasize
the role of population quantities of interest, about which we wish to
draw inference. Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.
Degree of models/assumptions
Statisticians distinguish between three levels of modeling assumptions:
Fully parametric:
The probability distributions describing the data-generation process
are assumed to be fully described by a family of probability
distributions involving only a finite number of unknown parameters. For example, one may assume that the distribution of population values
is truly Normal, with unknown mean and variance, and that datasets are
generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models.
Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal. For example, every continuous probability distribution has a median, which may be estimated using the sample median or the Hodges–Lehmann–Sen estimator, which has good properties when the data arise from simple random sampling.
Semi-parametric:
This term typically implies assumptions 'in between' fully and
non-parametric approaches. For example, one may assume that a population
distribution has a finite mean. Furthermore, one may assume that the
mean response level in the population depends in a truly linear manner
on some covariate (a parametric assumption) but not make any parametric
assumption describing the variance around that mean (i.e. about the
presence or possible form of any heteroscedasticity).
More generally, semi-parametric models can often be separated into
'structural' and 'random variation' components. One component is treated
parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.
The
above image shows a histogram assessing the assumption of normality,
which can be illustrated through the even spread underneath the bell
curve.
Whatever level of assumption is made, correctly calibrated inference,
in general, requires these assumptions to be correct; i.e. that the
data-generating mechanisms really have been correctly specified.
Incorrect assumptions of 'simple' random sampling can invalidate statistical inference. More complex semi- and fully parametric assumptions are also cause for
concern. For example, incorrectly assuming the Cox model can in some
cases lead to faulty conclusions. Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference. The use of any
parametric model is viewed skeptically by most experts in sampling
human populations: "most sampling statisticians, when they deal with
confidence intervals at all, limit themselves to statements about
[estimators] based on very large samples, where the central limit
theorem ensures that these [estimators] will have distributions that are
nearly normal." In particular, a normal distribution "would be a totally unrealistic
and catastrophically unwise assumption to make if we were dealing with
any kind of economic population." Here, the central limit theorem states that the distribution of the
sample mean "for very large samples" is approximately normally
distributed, if the distribution is not heavy-tailed.
Given the difficulty in specifying exact distributions of sample
statistics, many methods have been developed for approximating these.
With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the Berry–Esseen theorem. Yet for many practical purposes, the normal approximation provides a
good approximation to the sample-mean's distribution when there are 10
(or more) independent samples, according to simulation studies and
statisticians' experience. Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler divergence, Bregman divergence, and the Hellinger distance.
With indefinitely large samples, limiting results like the central limit theorem
describe the sample statistic's limiting distribution if one exists.
Limiting results are not statements about finite samples, and indeed are
irrelevant to finite samples. However, the asymptotic theory of limiting distributions is often
invoked for work with finite samples. For example, limiting results are
often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics.
The magnitude of the difference between the limiting distribution and
the true distribution (formally, the 'error' of the approximation) can
be assessed using simulation. The heuristic application of limiting results to finite samples is
common practice in many applications, especially with low-dimensional models with log-concavelikelihoods (such as with one-parameter exponential families).
For a given dataset that was produced by a randomization design, the
randomization distribution of a statistic (under the null-hypothesis) is
defined by evaluating the test statistic
for all of the plans that could have been generated by the
randomization design. In frequentist inference, the randomization allows
inferences to be based on the randomization distribution rather than a
subjective model, and this is important especially in survey sampling
and design of experiments. Statistical inference from randomized studies is also more straightforward than many other situations. In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.
Objective randomization allows properly inductive procedures. Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures. (However, it is true that in fields of science with developed
theoretical knowledge and experimental control, randomized experiments
may increase the costs of experimentation without improving the quality
of inferences.) Similarly, results from randomized experiments
are recommended by leading statistical authorities as allowing
inferences with greater reliability than do observational studies of the
same phenomena. However, a good observational study may be better than a bad randomized experiment.
The statistical analysis of a randomized experiment may be based
on the randomization scheme stated in the experimental protocol and does
not need a subjective model.
However, at any time, some hypotheses cannot be tested using
objective statistical models, which accurately describe randomized
experiments or random samples. In some cases, such randomized studies
are uneconomical or unethical.
Model-based analysis of randomized experiments
It is standard practice to refer to a statistical model, e.g., a
linear or logistic models, when analyzing data from randomized
experiments. However, the randomization scheme guides the choice of a statistical
model. It is not possible to choose an appropriate model without knowing
the randomization scheme. Seriously misleading results can be obtained analyzing data from
randomized experiments while ignoring the experimental protocol; common
mistakes include forgetting the blocking used in an experiment and
confusing repeated measurements on the same experimental unit with
independent replicates of the treatment applied to different
experimental units.
Model-free randomization inference
Model-free techniques provide a complement to model-based methods,
which employ reductionist strategies of reality-simplification. The
former combine, evolve, ensemble and train algorithms dynamically
adapting to the contextual affinities of a process and learning the
intrinsic characteristics of the observations.
For example, model-free simple linear regression is based either on:
a random design, where the pairs of observations are independent and identically distributed (iid),
or a deterministic design, where the variables are deterministic, but the corresponding response variables are random and independent with a common conditional distribution, i.e., , which is independent of the index .
In either case, the model-free randomization inference for features of the common conditional distribution
relies on some regularity conditions, e.g. functional smoothness. For
instance, model-free randomization inference for the population feature conditional mean, , can be consistently estimated via local averaging or local polynomial fitting, under the assumption that
is smooth. Also, relying on asymptotic normality or resampling, we can
construct confidence intervals for the population feature, in this case,
the conditional mean, .
Paradigms for inference
Different schools of statistical inference have become established.
These schools—or "paradigms"—are not mutually exclusive, and methods
that work well under one paradigm often have attractive interpretations
under other paradigms.
This paradigm calibrates the plausibility of propositions by
considering (notional) repeated sampling of a population distribution to
produce datasets similar to the one at hand. By considering the
dataset's characteristics under repeated sampling, the frequentist
properties of a statistical proposition can be quantified—although in
practice this quantification may be challenging.
Frequentist inference, objectivity, and decision theory
One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman develops these procedures in terms of pre-experiment probabilities.
That is, before undertaking an experiment, one decides on a rule for
coming to a conclusion such that the probability of being correct is
controlled in a suitable way: such a probability need not have a
frequentist or repeated sampling interpretation. In contrast, Bayesian
inference works in terms of conditional probabilities (i.e.
probabilities conditional on the observed data), compared to the
marginal (but conditioned on unknown parameters) probabilities used in
the frequentist approach.
The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions. In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions,
which play the role of (negative) utility functions. Loss functions
need not be explicitly stated for statistical theorists to prove that a
statistical procedure has an optimality property. However, loss-functions are often useful for stating optimality
properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss.
While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic
to be used, the absence of obviously explicit utilities and prior
distributions has helped frequentist procedures to become widely viewed
as 'objective'.
The Bayesian calculus describes degrees of belief using the
'language' of probability; beliefs are positive, integrate into one, and
obey probability axioms. Bayesian inference uses the available
posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach.
Bayesian inference, subjectivity and decision theory
Many informal Bayesian inferences are based on "intuitively
reasonable" summaries of the posterior. For example, the posterior mean,
median and mode, highest posterior density intervals, and Bayes Factors
can all be motivated in this way. While a user's utility function
need not be stated for this sort of inference, these summaries do all
depend (to some extent) on stated prior beliefs, and are generally
viewed as subjective conclusions. (Methods of prior construction which
do not require external input have been proposed but not yet fully developed.)
Formally, Bayesian inference is calibrated with reference to an
explicitly stated utility, or loss function; the 'Bayes rule' is the one
which maximizes expected utility, averaged over the posterior
uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic
sense. Given assumptions, data and utility, Bayesian inference can be
made for essentially any problem, although not every statistical
inference need have a Bayesian interpretation. Analyses which are not
formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e. those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.
Likelihood-based inference is a paradigm used to estimate the parameters of a statistical model based on observed data. Likelihoodism approaches statistics by using the likelihood function, denoted as , quantifies the probability of observing the given data , assuming a specific set of parameter values .
In likelihood-based inference, the goal is to find the set of parameter
values that maximizes the likelihood function, or equivalently,
maximizes the probability of observing the given data.
The process of likelihood-based inference usually involves the following steps:
Formulating the statistical model: A statistical model is
defined based on the problem at hand, specifying the distributional
assumptions and the relationship between the observed data and the
unknown parameters. The model can be simple, such as a normal
distribution with known variance, or complex, such as a hierarchical
model with multiple levels of random effects.
Constructing the likelihood function: Given the statistical model,
the likelihood function is constructed by evaluating the joint
probability density or mass function of the observed data as a function
of the unknown parameters. This function represents the probability of
observing the data for different values of the parameters.
Maximizing the likelihood function: The next step is to find the set
of parameter values that maximizes the likelihood function. This can be
achieved using optimization techniques such as numerical optimization
algorithms. The estimated parameter values, often denoted as , are the maximum likelihood estimates (MLEs).
Assessing uncertainty: Once the MLEs are obtained, it is crucial to
quantify the uncertainty associated with the parameter estimates. This
can be done by calculating standard errors, confidence intervals, or conducting hypothesis tests based on asymptotic theory or simulation techniques such as bootstrapping.
Model checking: After obtaining the parameter estimates and
assessing their uncertainty, it is important to assess the adequacy of
the statistical model. This involves checking the assumptions made in
the model and evaluating the fit of the model to the data using
goodness-of-fit tests, residual analysis, or graphical diagnostics.
Inference and interpretation: Finally, based on the estimated
parameters and model assessment, statistical inference can be performed.
This involves drawing conclusions about the population parameters,
making predictions, or testing hypotheses based on the estimated model.
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models
for a given set of data. Given a collection of models for the data, AIC
estimates the quality of each model, relative to each of the other
models. Thus, AIC provides a means for model selection.
AIC is founded on information theory:
it offers an estimate of the relative information lost when a given
model is used to represent the process that generated the data. (In
doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
The minimum description length (MDL) principle has been developed from ideas in information theory and the theory of Kolmogorov complexity. The (MDL) principle selects statistical models that maximally compress
the data; inference proceeds without assuming counterfactual or
non-falsifiable "data-generating mechanisms" or probability models for the data, as might be done in frequentist or Bayesian approaches.
However, if a "data generating mechanism" does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of the data, on average and asymptotically. In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropyBayesian priors).
However, MDL avoids assuming that the underlying probability model is
known; the MDL principle can also be applied without assumptions that
e.g. the data arose from independent sampling.
Fiducial inference was an approach to statistical inference based on fiducial probability,
also known as a "fiducial distribution". In subsequent work, this
approach has been called ill-defined, extremely limited in
applicability, and even fallacious. However this argument is the same as that which shows that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals,
it does not necessarily invalidate conclusions drawn from fiducial
arguments. An attempt was made to reinterpret the early work of Fisher's
fiducial argument as a special case of an inference theory using upper and lower probabilities.
Structural inference
Developing ideas of Fisher and of Pitman from 1938 to 1939, George A. Barnard developed "structural inference" or "pivotal inference", an approach using invariant probabilities on group families.
Barnard reformulated the arguments behind fiducial inference on a
restricted class of models on which "fiducial" procedures would be
well-defined and useful. Donald A. S. Fraser developed a general theory for structural inference based on group theory and applied this to linear models. The theory formulated by Fraser has close links to decision theory and
Bayesian statistics and can provide optimal frequentist decision rules
if they exist.
Inference topics
The topics below are usually included in the area of statistical inference.
Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations.
Initially, predictive inference was based on observable parameters and it was the main purpose of studying probability, but it fell out of favor in the 20th century due to a new parametric approach pioneered by Bruno de Finetti. The approach modeled phenomena as a physical system observed with error (e.g., celestial mechanics). De Finetti's idea of exchangeability—that
future observations should behave like past observations—came to the
attention of the English-speaking world with the 1974 translation from
French of his 1937 paper, and has since been propounded by such statisticians as Seymour Geisser.
Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population, and application of this knowledge to prevent diseases.
Epidemiology, literally meaning "the study of what is upon the people", is derived from Greek epi'upon, among'; demos'people, district' and logos'study, word, discourse',
suggesting that it applies only to human populations. However, the term
is widely used in studies of zoological populations (veterinary
epidemiology), although the term "epizoology" is available, and it has also been applied to studies of plant populations (botanical or plant disease epidemiology).
The distinction between "epidemic" and "endemic" was first drawn by Hippocrates, The term "epidemiology" appears to have first been used to describe the study of epidemics in 1802 by the Spanish physician Joaquín de Villalba [es] in Epidemiología Española. Epidemiologists also study the interaction of diseases in a population, a condition known as a syndemic.
The term epidemiology is now widely applied to cover the description and causation of not only epidemic, infectious disease, but of disease in general, including related conditions and, especially since the 20th century, chronic diseases
such as diabetes, cardiovascular disease, and cancer. Some examples of
topics examined through epidemiology include as high blood pressure,
mental illness and obesity. Therefore, this epidemiology is based upon how the pattern of the disease causes change in the function of human beings.
History
The Greek physician Hippocrates, taught by Democritus, was known as the father of medicine,sought a logic to sickness; he is the first person known to have
examined the relationships between the occurrence of disease and
environmental influences. Hippocrates believed sickness of the human body to be caused by an imbalance of the four humors
(black bile, yellow bile, blood, and phlegm). The cure to the sickness
was to remove or add the humor in question to balance the body. This
belief led to the application of bloodletting and dieting in medicine. He coined the terms endemic (for diseases usually found in some places but not in others) and epidemic (for diseases that are seen at some times but not others).
In the middle of the 16th century, a doctor from Verona named Girolamo Fracastoro
was the first to propose a theory that the very small, unseeable,
particles that cause disease were alive. They were considered to be able
to spread by air, multiply by themselves and to be destroyable by fire.
In this way he refuted Galen's miasma theory (poison gas in sick people). In 1543 he wrote a book De contagione et contagiosis morbis, in which he was the first to promote personal and environmental hygiene to prevent disease. The development of a sufficiently powerful microscope by Antonie van Leeuwenhoek in 1675 provided visual evidence of living particles consistent with a germ theory of disease.
During the Ming dynasty, Wu Youke (1582–1652) developed the idea that some diseases were caused by transmissible agents, which he called Li Qi (戾气 or pestilential factors) when he observed various epidemics rage around him between 1641 and 1644. His book Wen Yi Lun
(瘟疫论, Treatise on Pestilence/Treatise of Epidemic Diseases) can be
regarded as the main etiological work that brought forward the concept. His concepts were still being considered in analysing SARS outbreak by
WHO in 2004 in the context of traditional Chinese medicine.
Another pioneer, Thomas Sydenham
(1624–1689), was the first to distinguish the fevers of Londoners in
the later 1600s. His theories on cures of fevers met with much
resistance from traditional physicians at the time. He was not able to
find the initial cause of the smallpox fever he researched and treated.
John Graunt, a haberdasher and amateur statistician, published Natural and Political Observations ... upon the Bills of Mortality in 1662. In it, he analysed the mortality rolls in London before the Great Plague, presented one of the first life tables,
and reported time trends for many diseases, new and old. He provided
statistical evidence for many theories on disease, and also refuted some
widespread ideas on them.
John Snow is famous for his investigations into the causes of the 19th-century cholera epidemics, and is also known as the father of (modern) Epidemiology. He began with noticing the significantly higher death rates in two
areas supplied by Southwark Company. His identification of the Broad Street
pump as the cause of the Soho epidemic is considered the classic
example of epidemiology. Snow used chlorine in an attempt to clean the
water and removed the handle; this ended the outbreak. This has been
perceived as a major event in the history of public health and regarded as the founding event of the science of epidemiology, having helped shape public health policies around the world. However, Snow's research and preventive measures to avoid further
outbreaks were not fully accepted or put into practice until after his
death due to the prevailing Miasma Theory
of the time, a model of disease in which poor air quality was blamed
for illness. This was used to rationalize high rates of infection in
impoverished areas instead of addressing the underlying issues of poor
nutrition and sanitation, and was proven false by his work.
Other pioneers include Danish physician Peter Anton Schleisner, who in 1849 related his work on the prevention of the epidemic of neonatal tetanus on the Vestmanna Islands in Iceland. Another important pioneer was Hungarian physician Ignaz Semmelweis,
who in 1847 brought down infant mortality at a Vienna hospital by
instituting a disinfection procedure. His findings were published in
1850, but his work was ill-received by his colleagues, who discontinued
the procedure. Disinfection did not become widely practiced until
British surgeon Joseph Lister, aided by his college, chemist Thomas Anderson, was able to "discover" antiseptics in 1865 based on the earlier work of Louis Pasteur.
In the early 20th century, mathematical methods were introduced into epidemiology by Ronald Ross, Janet Lane-Claypon, Anderson Gray McKendrick, and others. In a parallel development during the 1920s, German-Swiss pathologist Max Askanazy
and others founded the International Society for Geographical Pathology
to systematically investigate the geographical pathology of cancer and
other non-infectious diseases across populations in different regions.
After World War II, Richard Doll
and other non-pathologists joined the field and advanced methods to
study cancer, a disease with patterns and mode of occurrences that could
not be suitably studied with the methods developed for epidemics of
infectious diseases. Geography pathology eventually combined with
infectious disease epidemiology to make the field that is epidemiology
today.
In the late 20th century, with the advancement of biomedical
sciences, a number of molecular markers in blood, other biospecimens and
environment were identified as predictors of development or risk of a
certain disease. Epidemiology research to examine the relationship
between these biomarkers analyzed at the molecular level and disease was broadly named "molecular epidemiology". Specifically, "genetic epidemiology"
has been used for epidemiology of germline genetic variation and
disease. Genetic variation is typically determined using DNA from
peripheral blood leukocytes.
21st century
Since the 2000s, genome-wide association studies (GWAS) have been commonly performed to identify genetic risk factors for many diseases and health conditions.
While most molecular epidemiology studies are still using conventional disease diagnosis
and classification systems, it is increasingly recognized that disease
progression represents inherently heterogeneous processes differing from
person to person. Conceptually, each individual has a unique disease
process different from any other individual ("the unique disease
principle"), considering uniqueness of the exposome
(a totality of endogenous and exogenous / environmental exposures) and
its unique influence on molecular pathologic process in each individual.
Studies to examine the relationship between an exposure and molecular
pathologic signature of disease (particularly cancer) became increasingly common throughout the 2000s. However, the use of molecular pathology in epidemiology posed unique challenges, including lack of research guidelines and standardized statistical methodologies, and paucity of interdisciplinary experts and training programs. Furthermore, the concept of disease heterogeneity appears to conflict
with the long-standing premise in epidemiology that individuals with the
same disease name have similar etiologies and disease processes. To
resolve these issues and advance population health science in the era of
molecular precision medicine, "molecular pathology" and "epidemiology" was integrated to create a new interdisciplinary field of "molecular pathological epidemiology" (MPE), defined as "epidemiology of molecular pathology and heterogeneity of
disease". In MPE, investigators analyze the relationships between (A)
environmental, dietary, lifestyle and genetic factors; (B) alterations
in cellular or extracellular molecules; and (C) evolution and
progression of disease. A better understanding of heterogeneity of
disease pathogenesis will further contribute to elucidate etiologies of disease. The MPE approach can be applied to not only neoplastic diseases but also non-neoplastic diseases. The concept and paradigm of MPE have become widespread in the 2010s.
By 2012, it was recognized that many pathogens' evolution
is rapid enough to be highly relevant to epidemiology, and that
therefore much could be gained from an interdisciplinary approach to
infectious disease integrating epidemiology and molecular evolution to "inform control strategies, or even patient treatment." Modern epidemiological studies can use advanced statistics and machine learning to create predictive models as well as to define treatment effects. There is increasing recognition that a wide range of modern data
sources, many not originating from healthcare or epidemiology, can be
used for epidemiological study. Such digital epidemiology can include
data from internet searching, mobile phone records and retail sales of
drugs.
Epidemiologists employ a range of study designs from the
observational to experimental and generally categorized as descriptive
(involving the assessment of data covering time, place, and person),
analytic (aiming to further examine known associations or hypothesized
relationships), and experimental (a term often equated with clinical or
community trials of treatments and other interventions). In
observational studies, nature is allowed to "take its course", as
epidemiologists observe from the sidelines. Conversely, in experimental
studies, the epidemiologist is the one in control of all of the factors
entering a certain case study. Epidemiological studies are aimed, where possible, at revealing unbiased relationships between exposures such as alcohol or smoking, biological agents, stress, or chemicals to mortality or morbidity.
The identification of causal relationships between these exposures and
outcomes is an important aspect of epidemiology. Modern epidemiologists
use informatics and infodemiology as tools.
Observational studies have two components, descriptive and
analytical. Descriptive observations pertain to the "who, what, where
and when of health-related state occurrence". However, analytical
observations deal more with the 'how' of a health-related event. Experimental epidemiology
contains three case types: randomized controlled trials (often used for
a new medicine or drug testing), field trials (conducted on those at a
high risk of contracting a disease), and community trials (research on
social originating diseases).
The term 'epidemiologic triad' is used to describe the intersection of Host, Agent, and Environment in analyzing an outbreak.
===\when they are unexposed.
The former type of study is purely descriptive and cannot be used
to make inferences about the general population of patients with that
disease. These types of studies, in which an astute clinician identifies
an unusual feature of a disease or a patient's history, may lead to a
formulation of a new hypothesis. Using the data from the series,
analytic studies could be done to investigate possible causal factors.
These can include case-control studies or prospective studies. A
case-control study would involve matching comparable controls without
the disease to the cases in the series. A prospective study would
involve following the case series over time to evaluate the disease's
natural history.
The latter type, more formally described as self-controlled case-series
studies, divide individual patient follow-up time into exposed and
unexposed periods and use fixed-effects Poisson regression processes to
compare the incidence rate of a given outcome between exposed and
unexposed periods. This technique has been extensively used in the study
of adverse reactions to vaccination and has been shown in some
circumstances to provide statistical power comparable to that available
in cohort studies.
Case-control studies
Case-control studies
select subjects based on their disease status. It is a retrospective
study. A group of individuals that are disease positive (the "case"
group) is compared with a group of disease negative individuals (the
"control" group). The control group should ideally come from the same
population that gave rise to the cases. The case-control study looks
back through time at potential exposures that both groups (cases and
controls) may have encountered. A 2×2 table is constructed, displaying
exposed cases (A), exposed controls (B), unexposed cases (C) and
unexposed controls (D). The statistic generated to measure association
is the odds ratio (OR),[53] which is the ratio of the odds of exposure in the cases (A/C) to the odds of exposure in the controls (B/D), i.e. OR = (AD/BC).
Cases
Controls
Exposed
A
B
Unexposed
C
D
If the OR is significantly greater than 1, then the conclusion is
"those with the disease are more likely to have been exposed", whereas
if it is close to 1 then the exposure and disease are not likely
associated. If the OR is far less than one, then this suggests that the
exposure is a protective factor in the causation of the disease.
Case-control studies are usually faster and more cost-effective than cohort studies but are sensitive to bias (such as recall bias and selection bias).
The main challenge is to identify the appropriate control group; the
distribution of exposure among the control group should be
representative of the distribution in the population that gave rise to
the cases. This can be achieved by drawing a random sample from the
original population at risk. This has as a consequence that the control
group can contain people with the disease under study when the disease
has a high attack rate in a population.
A major drawback for case control studies is that, in order to be
considered to be statistically significant, the minimum number of cases
required at the 95% confidence interval is related to the odds ratio by
the equation:
where N is the ratio of cases to controls.
As the odds ratio approaches 1, the number of cases required for
statistical significance grows towards infinity; rendering case-control
studies all but useless for low odds ratios. For instance, for an odds
ratio of 1.5 and cases = controls, the table shown above would look like
this:
Cases
Controls
Exposed
103
84
Unexposed
84
103
For an odds ratio of 1.1:
Cases
Controls
Exposed
1732
1652
Unexposed
1652
1732
Cohort studies
Cohort studies
select subjects based on their exposure status. The study subjects
should be at risk of the outcome under investigation at the beginning of
the cohort study; this usually means that they should be disease free
when the cohort study starts. The cohort is followed through time to
assess their later outcome status. An example of a cohort study would be
the investigation of a cohort of smokers and non-smokers over time to
estimate the incidence of lung cancer. The same 2×2 table is constructed
as with the case control study. However, the point estimate generated
is the relative risk (RR), which is the probability of disease for a person in the exposed group, Pe = A / (A + B) over the probability of disease for a person in the unexposed group, Pu = C / (C + D), i.e. RR = Pe / Pu.
.....
Case
Non-case
Total
Exposed
A
B
(A + B)
Unexposed
C
D
(C + D)
As with the OR, a RR greater than 1 shows association, where the
conclusion can be read "those with the exposure were more likely to
develop the disease."
Prospective studies have many benefits over case control studies.
The RR is a more powerful effect measure than the OR, as the OR is just
an estimation of the RR, since true incidence cannot be calculated in a
case control study where subjects are selected based on disease status.
Temporality can be established in a prospective study, and confounders
are more easily controlled for. However, they are more costly, and there
is a greater chance of losing subjects to follow-up based on the long
time period over which the cohort is followed.
Cohort studies also are limited by the same equation for number
of cases as for cohort studies, but, if the base incidence rate in the
study population is very low, the number of cases required is reduced
by 1⁄2.
Although epidemiology is sometimes viewed as a collection of
statistical tools used to elucidate the associations of exposures to
health outcomes, a deeper understanding of this science is that of
discovering causal relationships.
"Correlation does not imply causation" is a common theme for much of the epidemiological literature. For epidemiologists, the key is in the term inference.
Correlation, or at least association between two variables, is a
necessary but not sufficient criterion for the inference that one
variable causes the other. Epidemiologists use gathered data and a broad
range of biomedical and psychosocial theories in an iterative way to
generate or expand theory, to test hypotheses, and to make educated,
informed assertions about which relationships are causal, and about
exactly how they are causal.
Epidemiologists emphasize that the "one cause – one effect" understanding is a simplistic mis-belief. Most outcomes, whether disease or death, are caused by a chain or web consisting of many component causes. Causes can be distinguished as necessary, sufficient or probabilistic
conditions. If a necessary condition can be identified and controlled
(e.g., antibodies to a disease agent, energy in an injury), the harmful
outcome can be avoided (Robertson, 2015). One tool regularly used to
conceptualize the multicausality associated with disease is the causal pie model.
In 1965, Austin Bradford Hill proposed a series of considerations to help assess evidence of causation, which have come to be commonly known as the "Bradford Hill criteria".
In contrast to the explicit intentions of their author, Hill's
considerations are now sometimes taught as a checklist to be implemented
for assessing causality. Hill himself said "None of my nine viewpoints can bring indisputable
evidence for or against the cause-and-effect hypothesis and none can be
required sine qua non."
Strength of Association: A small association does not
mean that there is not a causal effect, though the larger the
association, the more likely that it is causal.
Consistency of Data: Consistent findings observed by
different persons in different places with different samples strengthens
the likelihood of an effect.
Specificity: Causation is likely if a very specific
population at a specific site and disease with no other likely
explanation. The more specific an association between a factor and an
effect is, the bigger the probability of a causal relationship.
Temporality: The effect has to occur after the cause (and if
there is an expected delay between the cause and expected effect, then
the effect must occur after that delay).
Biological gradient: Greater exposure should generally lead
to greater incidence of the effect. However, in some cases, the mere
presence of the factor can trigger the effect. In other cases, an
inverse proportion is observed: greater exposure leads to lower
incidence.
Plausibility: A plausible mechanism between cause and effect
is helpful (but Hill noted that knowledge of the mechanism is limited by
current knowledge).
Coherence: Coherence between epidemiological and laboratory
findings increases the likelihood of an effect. However, Hill noted that
"... lack of such [laboratory] evidence cannot nullify the
epidemiological effect on associations".
Experiment: "Occasionally it is possible to appeal to experimental evidence".
Analogy: The effect of similar factors may be considered.
Legal interpretation
Epidemiological studies can only go to prove that an agent could have caused, but not that it did cause, an effect in any particular case:
Epidemiology is concerned with the incidence
of disease in populations and does not address the question of the
cause of an individual's disease. This question, sometimes referred to
as specific causation, is beyond the domain of the science of
epidemiology. Epidemiology has its limits at the point where an
inference is made that the relationship between an agent and a disease
is causal (general causation) and where the magnitude of excess risk
attributed to the agent has been determined; that is, epidemiology
addresses whether an agent can cause disease, not whether an agent did
cause a specific plaintiff's disease.
In United States law, epidemiology alone cannot prove that a causal
association does not exist in general. Conversely, it can be (and is in
some circumstances) taken by US courts, in an individual case, to
justify an inference that a causal association does exist, based upon a
balance of probability.
The subdiscipline of forensic epidemiology is directed at the
investigation of specific causation of disease or injury in individuals
or groups of individuals in instances in which causation is disputed or
is unclear, for presentation in legal settings.
Population-based health management
Epidemiological practice and the results of epidemiological analysis
make a significant contribution to emerging population-based health
management frameworks.
Population-based health management encompasses the ability to:
Assess the health states and health needs of a target population;
Implement and evaluate interventions that are designed to improve the health of that population; and
Efficiently and effectively provide care for members of that
population in a way that is consistent with the community's cultural,
policy and health resource values.
Modern population-based health management is complex, requiring a
multiple set of skills (medical, political, technological, mathematical,
etc.) of which epidemiological practice and analysis is a core
component, that is unified with management science to provide efficient
and effective health care and health guidance to a population. This task
requires the forward-looking ability of modern risk management
approaches that transform health risk factors, incidence, prevalence and
mortality statistics (derived from epidemiological analysis) into
management metrics that not only guide how a health system responds to
current population health issues but also how a health system can be
managed to better respond to future potential population health issues.
Examples of organizations that use population-based health
management that leverage the work and results of epidemiological
practice include Canadian Strategy for Cancer Control, Health Canada
Tobacco Control Programs, Rick Hansen Foundation, Canadian Tobacco
Control Research Initiative.
Each of these organizations uses a population-based health
management framework called Life at Risk that combines epidemiological
quantitative analysis with demographics, health agency operational
research and economics to perform:
Population Life Impacts Simulations: Measurement of the
future potential impact of disease upon the population with respect to
new disease cases, prevalence, premature death as well as potential
years of life lost from disability and death;
Labour Force Life Impacts Simulations: Measurement of the
future potential impact of disease upon the labour force with respect to
new disease cases, prevalence, premature death and potential years of
life lost from disability and death;
Economic Impacts of Disease Simulations: Measurement of the
future potential impact of disease upon private sector disposable income
impacts (wages, corporate profits, private health care costs) and
public sector disposable income impacts (personal income tax, corporate
income tax, consumption taxes, publicly funded health care costs).
Applied field epidemiology
Applied epidemiology is the practice of using epidemiological methods
to protect or improve the health of a population. Applied field
epidemiology can include investigating communicable and non-communicable
disease outbreaks, mortality and morbidity rates, and nutritional
status, among other indicators of health, with the purpose of
communicating the results to those who can implement appropriate
policies or disease control measures.
Humanitarian context
As the surveillance and reporting of diseases and other health
factors become increasingly difficult in humanitarian crisis situations,
the methodologies used to report the data are compromised. One study
found that less than half (42.4%) of nutrition surveys sampled from
humanitarian contexts correctly calculated the prevalence of
malnutrition and only one-third (35.3%) of the surveys met the criteria
for quality. Among the mortality surveys, only 3.2% met the criteria for
quality. As nutritional status and mortality rates help indicate the
severity of a crisis, the tracking and reporting of these health factors
is crucial.
Vital registries are usually the most effective ways to collect
data, but in humanitarian contexts these registries can be non-existent,
unreliable, or inaccessible. As such, mortality is often inaccurately
measured using either prospective demographic surveillance or
retrospective mortality surveys. Prospective demographic surveillance
requires much manpower and is difficult to implement in a spread-out
population. Retrospective mortality surveys are prone to selection and
reporting biases. Other methods are being developed, but are not common
practice yet.
Characterization, validity, and bias
Epidemic wave
The concept of waves in epidemics has implications especially for communicable diseases.
A working definition for the term "epidemic wave" is based on two key
features: 1) it comprises periods of upward or downward trends, and 2)
these increases or decreases must be substantial and sustained over a
period of time, in order to distinguish them from minor fluctuations or
reporting errors. The use of a consistent scientific definition is to provide a
consistent language that can be used to communicate about and understand
the progression of the COVID-19 pandemic, which would aid healthcare
organizations and policymakers in resource planning and allocation.
Validities
Different fields in epidemiology have different levels of validity.
One way to assess the validity of findings is the ratio of
false-positives (claimed effects that are not correct) to
false-negatives (studies which fail to support a true effect). In genetic epidemiology,
candidate-gene studies may produce over 100 false-positive findings for
each false-negative. By contrast genome-wide association appear close
to the reverse, with only one false positive for every 100 or more
false-negatives. This ratio has improved over time in genetic epidemiology, as the field
has adopted stringent criteria. By contrast, other epidemiological
fields have not required such rigorous reporting and are much less
reliable as a result.
Random error
Random error is the result of fluctuations around a true value
because of sampling variability. Random error is just that: random. It
can occur during data collection, coding, transfer, or analysis.
Examples of random errors include poorly worded questions, a
misunderstanding in interpreting an individual answer from a particular
respondent, or a typographical error during coding. Random error affects
measurement in a transient, inconsistent manner and it is impossible to
correct for random error. There is a random error in all sampling
procedures – sampling error.
Precision in epidemiological variables is a measure of random
error. Precision is also inversely related to random error, so that to
reduce random error is to increase precision. Confidence intervals are
computed to demonstrate the precision of relative risk estimates. The
narrower the confidence interval, the more precise the relative risk
estimate.
There are two basic ways to reduce random error in an epidemiological study.
The first is to increase the sample size of the study. In other words,
add more subjects to your study. The second is to reduce the variability
in measurement in the study. This might be accomplished by using a more
precise measuring device or by increasing the number of measurements.
Note, that if sample size or number of measurements are
increased, or a more precise measuring tool is purchased, the costs of
the study are usually increased. There is usually an uneasy balance
between the need for adequate precision and the practical issue of study
cost.
Systematic error
A systematic error or bias occurs when there is a difference between
the true value (in the population) and the observed value (in the study)
from any cause other than sampling variability. An example of
systematic error is if, unknown to you, the pulse oximeter
you are using is set incorrectly and adds two points to the true value
each time a measurement is taken. The measuring device could be precise but not accurate.
Because the error happens in every instance, it is systematic.
Conclusions you draw based on that data will still be incorrect. But the
error can be reproduced in the future (e.g., by using the same mis-set
instrument).
A mistake in coding that affects all responses for that particular question is another example of a systematic error.
The validity of a study is dependent on the degree of systematic error. Validity is usually separated into two components:
Internal validity
is dependent on the amount of error in measurements, including
exposure, disease, and the associations between these variables. Good
internal validity implies a lack of error in measurement and suggests
that inferences may be drawn at least as they pertain to the subjects
under study.
External validity
pertains to the process of generalizing the findings of the study to
the population from which the sample was drawn (or even beyond that
population to a more universal statement). This requires an
understanding of which conditions are relevant (or irrelevant) to the
generalization. Internal validity is clearly a prerequisite for external
validity.
Selection bias
Selection bias
occurs when study subjects are selected or become part of the study as a
result of a third, unmeasured variable which is associated with both
the exposure and outcome of interest. For instance, it has repeatedly been noted that cigarette smokers and
non smokers tend to differ in their study participation rates. (Sackett D
cites the example of Seltzer et al., in which 85% of non smokers and
67% of smokers returned mailed questionnaires.) Such a difference in response will not lead to bias if it is not also
associated with a systematic difference in outcome between the two
response groups.
Information bias
Information bias is bias arising from systematic error in the assessment of a variable. An example of this is recall bias. A typical example is again provided
by Sackett in his discussion of a study examining the effect of specific
exposures on fetal health: "in questioning mothers whose recent
pregnancies had ended in fetal death or malformation (cases) and a
matched group of mothers whose pregnancies ended normally (controls) it
was found that 28% of the former, but only 20% of the latter, reported
exposure to drugs which could not be substantiated either in earlier
prospective interviews or in other health records". In this example, recall bias probably occurred as a result of women who
had had miscarriages having an apparent tendency to better recall and
therefore report previous exposures.
Design-related bias
Next to sample- and variable-related bias, bias can also arise from
an imperfect study design. One example is immortal time bias, where
during study period, there is some interval during which the outcome
event cannot occur (making these individual "immortal").
Confounding
Confounding
has traditionally been defined as bias arising from the co-occurrence
or mixing of effects of extraneous factors, referred to as confounders,
with the main effect(s) of interest. A more recent definition of confounding invokes the notion of counterfactual effects. According to this view, when one observes an outcome of interest, say
Y=1 (as opposed to Y=0), in a given population A which is entirely
exposed (i.e. exposure X = 1 for every unit of the population) the risk of this event will be RA1. The counterfactual or unobserved risk RA0 corresponds to the risk which would have been observed if these same individuals had been unexposed (i.e. X = 0 for every unit of the population). The true effect of exposure therefore is: RA1 − RA0 (if one is interested in risk differences) or RA1/RA0 (if one is interested in relative risk). Since the counterfactual risk RA0 is unobservable we approximate it using a second population B and we actually measure the following relations: RA1 − RB0 or RA1/RB0. In this situation, confounding occurs when RA0 ≠ RB0. (NB: Example assumes binary outcome and exposure variables.)
Some epidemiologists prefer to think of confounding separately
from common categorizations of bias since, unlike selection and
information bias, confounding stems from real causal effects.
The profession
Few universities have offered epidemiology as a course of study at the undergraduate level. An undergraduate program exists at Johns Hopkins University
in which students who major in public health can take graduate-level
courses—including epidemiology—during their senior year at the Bloomberg School of Public Health. In addition to its master's and doctoral degrees in epidemiology, the University of Michigan School of Public Health has offered undergraduate degree programs since 2017 that include coursework in epidemiology.
Although epidemiologic research is conducted by individuals from
diverse disciplines, variable levels of training in epidemiologic
methods are provided during pharmacy, medical, veterinary, social work, podiatry, nursing, physical therapy, and clinical psychology doctoral programs in addition to the formal training master's and doctoral students in public health fields receive.
As public health practitioners, epidemiologists work in a number
of different settings. Some epidemiologists work "in the field" (i.e.,
in the community; commonly in a public health service), and are often at the forefront of investigating and combating disease outbreaks. Others work for non-profit organizations, universities, hospitals, or
larger government entities (e.g., state and local health departments in
the United States), ministries of health, Doctors without Borders, the Centers for Disease Control and Prevention (CDC), the Health Protection Agency, the World Health Organization (WHO), or the Public Health Agency of Canada.
Epidemiologists can also work in for-profit organizations (e.g.,
pharmaceutical and medical device companies) in groups such as market
research or clinical development.
COVID-19
An April 2020 University of Southern California article noted that, "The coronavirus epidemic...
thrust epidemiology – the study of the incidence, distribution and
control of disease in a population – to the forefront of scientific
disciplines across the globe and even made temporary celebrities out of
some of its practitioners."