In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower sampling probability than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected.
If this is not accounted for, results can be erroneously attributed to
the phenomenon under study rather than to the method of sampling.
Medical sources sometimes refer to sampling bias as ascertainment bias. Ascertainment bias has basically the same definition, but is still sometimes classified as a separate type of bias.
Distinction from selection bias
Sampling bias is mostly classified as a subtype of selection bias, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias.
A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity of a test (the ability of its results to be generalized to the entire population), while selection bias mainly addresses internal validity
for differences or similarities found in the sample at hand. In this
sense, errors occurring in the process of gathering the sample or cohort
cause sampling bias, while errors in any process thereafter cause
selection bias.
However, selection bias and sampling bias are often used synonymously.
Types
Selection from a specific real area.
For example, a survey of high school students to measure teenage use of
illegal drugs will be a biased sample because it does not include
home-schooled students or dropouts. A sample is also biased if certain
members are underrepresented or overrepresented relative to others in
the population. For example, a "man on the street" interview which
selects people who walk by a certain location is going to have an
overrepresentation of healthy individuals who are more likely to be out
of the home than individuals with a chronic illness. This may be an
extreme form of biased sampling, because certain members of the
population are totally excluded from the sample (that is, they have zero
probability of being selected).
Self-selection bias,
which is possible whenever the group of people being studied has any
form of control over whether to participate (as current standards of human-subject research ethics
require for many real-time and some longitudinal forms of study).
Participants' decision to participate may be correlated with traits that
affect the study, making the participants a non-representative sample.
For example, people who have strong opinions or substantial knowledge
may be more willing to spend time answering a survey than those who do
not. Another example is online and phone-in polls,
which are biased samples because the respondents are self-selected.
Those individuals who are highly motivated to respond, typically
individuals who have strong opinions, are overrepresented, and
individuals that are indifferent or apathetic are less likely to
respond. This often leads to a polarization of responses with extreme
perspectives being given a disproportionate weight in the summary. As a
result, these types of polls are regarded as unscientific.
Pre-screening of trial participants, or advertising
for volunteers within particular groups. For example, a study to "prove"
that smoking does not affect fitness might recruit at the local fitness
center, but advertise for smokers during the advanced aerobics class,
and for non-smokers during the weight loss sessions.
Exclusion bias results from exclusion of particular groups from the sample, e.g. exclusion of subjects who have recently migrated
into the study area (this may occur when newcomers are not available in
a register used to identify the source population). Excluding subjects
who move out of the study area during follow-up is rather equivalent of
dropout or nonresponse, a selection bias in that it rather affects the internal validity of the study.
Healthy user bias,
when the study population is likely healthier than the general
population. For example, someone in poor health is unlikely to have a
job as manual laborer.
Berkson's fallacy,
when the study population is selected from a hospital and so is less
healthy than the general population. This can result in a spurious
negative correlation between diseases: a hospital patient without
diabetes is more likely to have another given disease such as cholecystitis, since they must have had some reason to enter the hospital in the first place.
Overmatching, matching for an apparent confounder that actually is a result of the exposure. The control group becomes more similar to the cases in regard to exposure than does the general population.
Survivorship bias,
in which only "surviving" subjects are selected, ignoring those that
fell out of view. For example, using the record of current companies as
an indicator of business climate or economy ignores the businesses that
failed and no longer exist.
Malmquist bias, an effect in observational astronomy which leads to the preferential detection of intrinsically bright objects.
Symptom-based sampling
The
study of medical conditions begins with anecdotal reports. By their
nature, such reports only include those referred for diagnosis and
treatment. A child who can't function in school is more likely to be
diagnosed with dyslexia
than a child who struggles but passes. A child examined for one
condition is more likely to be tested for and diagnosed with other
conditions, skewing comorbidity statistics. As certain diagnoses become associated with behavior problems or intellectual disability,
parents try to prevent their children from being stigmatized with those
diagnoses, introducing further bias. Studies carefully selected from
whole populations are showing that many conditions are much more common
and usually much milder than formerly believed.
Truncate selection in pedigree studies
Simple pedigree example of sampling bias
Geneticists are limited in how they can obtain data from human
populations. As an example, consider a human characteristic. We are
interested in deciding if the characteristic is inherited as a simple Mendelian trait. Following the laws of Mendelian inheritance,
if the parents in a family do not have the characteristic, but carry
the allele for it, they are carriers (e.g. a non-expressive heterozygote).
In this case their children will each have a 25% chance of showing the
characteristic. The problem arises because we can't tell which families
have both parents as carriers (heterozygous) unless they have a child
who exhibits the characteristic. The description follows the textbook by
Sutton.
The figure shows the pedigrees of all the possible families with two children when the parents are carriers (Aa).
Nontruncate selection. In a perfect world we should be
able to discover all such families with a gene including those who are
simply carriers. In this situation the analysis would be free from
ascertainment bias and the pedigrees would be under "nontruncate
selection" In practice, most studies identify, and include, families in
a study based upon them having affected individuals.
Truncate selection. When afflicted individuals have an
equal chance of being included in a study this is called truncate
selection, signifying the inadvertent exclusion (truncation) of families
who are carriers for a gene. Because selection is performed on the
individual level, families with two or more affected children would have
a higher probability of becoming included in the study.
Complete truncate selection is a special case where each family with an affected child has an equal chance of being selected for the study.
The probabilities of each of the families being selected is given in
the figure, with the sample frequency of affected children also given.
In this simple case, the researcher will look for a frequency of 4⁄7 or 5⁄8 for the characteristic, depending on the type of truncate selection used.
The caveman effect
An example of selection bias is called the "caveman effect". Much of our understanding of prehistoric peoples comes from caves, such as cave paintings
made nearly 40,000 years ago. If there had been contemporary paintings
on trees, animal skins or hillsides, they would have been washed away
long ago. Similarly, evidence of fire pits, middens, burial sites,
etc. are most likely to remain intact to the modern era in caves.
Prehistoric people are associated with caves because that is where the
data still exists, not necessarily because most of them lived in caves
for most of their lives.
Problems due to sampling bias
Sampling bias is problematic because it is possible that a statistic
computed of the sample is systematically erroneous. Sampling bias can
lead to a systematic over- or under-estimation of the corresponding parameter
in the population. Sampling bias occurs in practice as it is
practically impossible to ensure perfect randomness in sampling. If the
degree of misrepresentation is small, then the sample can be treated as a
reasonable approximation to a random sample. Also, if the sample does
not differ markedly in the quantity being measured, then a biased sample
can still be a reasonable estimate.
The word bias has a strong negative connotation. Indeed, biases sometimes come from deliberate intent to mislead or other scientific fraud.
In statistical usage, bias merely represents a mathematical property,
no matter if it is deliberate or unconscious or due to imperfections in
the instruments used for observation. While some individuals might
deliberately use a biased sample to produce misleading results, more
often, a biased sample is just a reflection of the difficulty in
obtaining a truly representative sample, or ignorance of the bias in
their process of measurement or analysis. An example of how ignorance
of a bias can exist is in the widespread use of a ratio (a.k.a. fold change)
as a measure of difference in biology. Because it is easier to achieve
a large ratio with two small numbers with a given difference, and
relatively more difficult to achieve a large ratio with two large
numbers with a larger difference, large significant differences may be
missed when comparing relatively large numeric measurements. Some have
called this a 'demarcation bias' because the use of a ratio (division)
instead of a difference (subtraction) removes the results of the
analysis from science into pseudoscience.
Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S. National Center for Health Statistics,
for example, deliberately oversamples from minority populations in many
of its nationwide surveys in order to gain sufficient precision for
estimates within these groups.
These surveys require the use of sample weights (see later on) to
produce proper estimates across all ethnic groups. Provided that certain
conditions are met (chiefly that the weights are calculated and used
correctly) these samples permit accurate estimation of population
parameters.
Historical examples
Example of biased sample: as of June 2008 55% of web browsers (Internet Explorer) in use did not pass the Acid2 test. Due to the nature of the test, the sample consisted mostly of web developers.
A classic example of a biased sample and the misleading results it
produced occurred in 1936. In the early days of opinion polling, the
American Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon, would beat the incumbent president, Franklin Roosevelt,
by a large margin. The result was the exact opposite. The Literary
Digest survey represented a sample collected from readers of the
magazine, supplemented by records of registered automobile owners and
telephone users. This sample included an over-representation of
individuals who were rich, who, as a group, were more likely to vote for
the Republican candidate. In contrast, a poll of only 50 thousand
citizens selected by George Gallup's organization successfully predicted the result, leading to the popularity of the Gallup poll.
Another classic example occurred in the 1948 presidential election. On election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. In the morning the grinning president-elect, Harry S. Truman,
was photographed holding a newspaper bearing this headline. The reason
the Tribune was mistaken is that their editor trusted the results of a phone survey.
Survey research was then in its infancy, and few academics realized
that a sample of telephone users was not representative of the general
population. Telephones were not yet widespread, and those who had them
tended to be prosperous and have stable addresses. (In many cities, the Bell Systemtelephone directory contained the same names as the Social Register). In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing.
Statistical corrections for a biased sample
If
entire segments of the population are excluded from a sample, then
there are no adjustments that can produce estimates that are
representative of the entire population. But if some groups are
underrepresented and the degree of underrepresentation can be
quantified, then sample weights can correct the bias. However, the
success of the correction is limited to the selection model chosen. If
certain variables are missing the methods used to correct the bias could
be inaccurate.
For example, a hypothetical population might include 10 million
men and 10 million women. Suppose that a biased sample of 100 patients
included 20 men and 80 women. A researcher could correct for this
imbalance by attaching a weight of 2.5 for each male and 0.625 for each
female. This would adjust any estimates to achieve the same expected
value as a sample that included exactly 50 men and 50 women, unless men
and women differed in their likelihood of taking part in the survey.
Selection bias is the bias
introduced by the selection of individuals, groups or data for analysis
in such a way that proper randomization is not achieved, thereby
ensuring that the sample obtained is not representative of the
population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical
analysis, resulting from the method of collecting samples. If the
selection bias is not taken into account, then some conclusions of the
study may be false.
Types
There are many types of possible selection bias, including:
Sampling bias
Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample, defined as a statistical sample of a population (or non-human factors) in which all participants are not equally balanced or objectively represented. It is mostly classified as a subtype of selection bias, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias.
A distinction of sampling bias (albeit not a universally accepted one) is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity
for differences or similarities found in the sample at hand. In this
sense, errors occurring in the process of gathering the sample or cohort
cause sampling bias, while errors in any process thereafter cause
selection bias.
Examples of sampling bias include self-selection,
pre-screening of trial participants, discounting trial subjects/tests
that did not run to completion and migration bias by excluding subjects
who have recently moved into or out of the study area.
Time interval
Early termination of a trial at a time when its results support the desired conclusion.
A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Exposure
Susceptibility bias
Clinical susceptibility bias, when one disease
predisposes for a second disease, and the treatment for the first
disease erroneously appears to predispose to the second disease. For
example, postmenopausal syndrome gives a higher likelihood of also developing endometrial cancer, so estrogens given for the postmenopausal syndrome may receive a higher than actual blame for causing endometrial cancer.
Protopathic bias, when a treatment for the first symptoms of a
disease or other outcome appear to cause the outcome. It is a potential
bias when there is a lag time from the first symptoms and start of
treatment before actual diagnosis. It can be mitigated by lagging, that is, exclusion of exposures that occurred in a certain time period before diagnosis.
Indication bias, a potential mixup between cause and effect
when exposure is dependent on indication, e.g. a treatment is given to
people in high risk of acquiring a disease, potentially causing a
preponderance of treated people among those acquiring the disease. This
may cause an erroneous appearance of the treatment being a cause of the
disease.
Data
Partitioning
(dividing) data with knowledge of the contents of the partitions, and
then analyzing them with tests designed for blindly chosen partitions.
Post hoc alteration of data inclusion based on arbitrary or subjective reasons, including:
Cherry picking, which actually is not selection bias, but confirmation bias,
when specific subsets of data are chosen to support a conclusion (e.g.
citing examples of plane crashes as evidence of airline flight being
unsafe, while ignoring the far more common example of flights that
complete safely.)
Rejection of bad data on (1) arbitrary grounds, instead of according
to previously stated or generally agreed criteria or (2) discarding "outliers" on statistical grounds that fail to take into account important information that could be derived from "wild" observations.
Performing repeated experiments and reporting only the most
favorable results, perhaps relabelling lab records of other experiments
as "calibration tests", "instrumentation errors" or "preliminary
surveys".
Presenting the most significant result of a data dredge as if it were a single experiment (which is logically the same as the previous item, but is seen as much less dishonest).
Attrition
Attrition bias is a kind of selection bias caused by attrition (loss of participants), discounting trial subjects/tests that did not run to completion. It is closely related to the survivorship bias, where only the subjects that "survived" a process are included in the analysis or the failure bias, where only the subjects that "failed" a process are included. It includes dropout, nonresponse (lower response rate), withdrawal and protocol deviators.
It gives biased results where it is unequal in regard to exposure
and/or outcome. For example, in a test of a dieting program, the
researcher may simply reject everyone who drops out of the trial, but
most of those who drop out are those for whom it was not working.
Different loss of subjects in intervention and comparison group may
change the characteristics of these groups and outcomes irrespective of
the studied intervention.
Observer selection
Philosopher Nick Bostrom
has argued that data are filtered not only by study design and
measurement, but by the necessary precondition that there has to be
someone doing a study. In situations where the existence of the observer
or the study is correlated with the data, observation selection effects
occur, and anthropic reasoning is required.
An example is the past impact event
record of Earth: if large impacts cause mass extinctions and ecological
disruptions precluding the evolution of intelligent observers for long
periods, no one will observe any evidence of large impacts in the recent
past (since they would have prevented intelligent observers from
evolving). Hence there is a potential bias in the impact record of
Earth. Astronomical existential risks might similarly be underestimated due to selection bias, and an anthropic correction has to be introduced.
Mitigation
In the general case, selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases. An assessment of the degree of selection bias can be made by examining correlations between exogenous (background) variables and a treatment indicator. However, in regression models, it is correlation between unobserved determinants of the outcome and unobserved
determinants of selection into the sample which bias estimates, and
this correlation between unobservables cannot be directly assessed by
the observed determinants of treatment.
Related issues
Selection bias is closely related to:
publication bias or reporting bias, the distortion produced in community perception or meta-analyses
by not publishing uninteresting (usually negative) results, or results
which go against the experimenter's prejudices, a sponsor's interests,
or community expectations.
confirmation bias,
the general tendency of humans to give more attention to whatever
confirms our pre-existing perspective; or specifically in experimental
science, the distortion produced by experiments that are designed to
seek confirmatory evidence instead of trying to disprove the hypothesis.
exclusion bias, results from applying different criteria to cases
and controls in regards to participation eligibility for a
study/different variables serving as basis for exclusion.
In statistics, regression toward (or to) the mean is the phenomenon that arises if a random variable is extreme on its first measurement but closer to the mean or average on its second measurement and if it is extreme on its second measurement but closer to the average on its first. To avoid making incorrect inferences, regression toward the mean must be considered when designing scientific experiments and interpreting data. Historically, what is now called regression toward the mean has also been called reversion to the mean and reversion to mediocrity.
The conditions under which regression toward the mean occurs depend on
the way the term is mathematically defined. The British polymath Sir Francis Galton first observed the phenomenon in the context of simple linear regression of data points. Galton developed the following model: pellets fall through a quincunx to form a normal distribution
centered directly under their entrance point. These pellets might then
be released down into a second gallery corresponding to a second
measurement. Galton then asked the reverse question: "From where did
these pellets come?"
The answer was not 'on average directly above'. Rather it was 'on average, more towards the middle',
for the simple reason that there were more pellets above it towards the
middle that could wander left than there were in the left extreme that
could wander to the right, inwards.
As a less restrictive approach, regression towards the mean can be defined for any bivariate distribution with identical marginal distributions. Two such definitions exist.
One definition accords closely with the common usage of the term
"regression towards the mean". Not all such bivariate distributions show
regression towards the mean under this definition. However, all such
bivariate distributions show regression towards the mean under the other
definition.
Jeremy Siegel uses the term "return to the mean" to describe a financial time series in which "returns can be very unstable in the short run but very stable in the long run." More quantitatively, it is one in which the standard deviation of average annual returns declines faster than the inverse of the holding period, implying that the process is not a random walk,
but that periods of lower returns are systematically followed by
compensating periods of higher returns, as is the case in many seasonal
businesses, for example.
Conceptual background
Consider
a simple example: a class of students takes a 100-item true/false test
on a subject. Suppose that all students choose randomly on all
questions. Then, each student's score would be a realization of one of a
set of independent and identically distributedrandom variables, with an expected mean
of 50. Naturally, some students will score substantially above 50 and
some substantially below 50 just by chance. If one takes only the top
scoring 10% of the students and gives them a second test on which they
again choose randomly on all items, the mean score would again be
expected to be close to 50. Thus the mean of these students would
"regress" all the way back to the mean of all students who took the
original test. No matter what a student scores on the original test, the
best prediction of their score on the second test is 50.
If choosing answers to the test questions was not random – i.e.
if there were no luck (good or bad) or random guessing involved in the
answers supplied by the students – then all students would be expected
to score the same on the second test as they scored on the original
test, and there would be no regression toward the mean.
Most realistic situations fall between these two extremes: for example, one might consider exam scores as a combination of skill and luck.
In this case, the subset of students scoring above average would be
composed of those who were skilled and had not especially bad luck,
together with those who were unskilled, but were extremely lucky. On a
retest of this subset, the unskilled will be unlikely to repeat their
lucky break, while the skilled will have a second chance to have bad
luck. Hence, those who did well previously are unlikely to do quite as
well in the second test even if the original cannot be replicated.
The following is an example of this second kind of regression
toward the mean. A class of students takes two editions of the same test
on two successive days. It has frequently been observed that the worst
performers on the first day will tend to improve their scores on the
second day, and the best performers on the first day will tend to do
worse on the second day.
The phenomenon occurs because student scores are determined in part by
underlying ability and in part by chance. For the first test, some will
be lucky, and score more than their ability, and some will be unlucky
and score less than their ability. Some of the lucky students on the
first test will be lucky again on the second test, but more of them will
have (for them) average or below average scores. Therefore, a student
who was lucky on the first test is more likely to have a worse score on
the second test than a better score. Similarly, students who score less
than the mean on the first test will tend to see their scores increase
on the second test.
Other examples
If
your favorite sports team won the championship last year, what does
that mean for their chances for winning next season? To the extent this
result is due to skill (the team is in good condition, with a top coach,
etc.), their win signals that it is more likely they will win again
next year. But the greater the extent this is due to luck (other teams
embroiled in a drug scandal, favorable draw, draft picks turned out to
be productive, etc.), the less likely it is they will win again next
year.
If one medical trial suggests that a particular drug or treatment
is outperforming all other treatments for a condition, then in a second
trial it is more likely that the outperforming drug or treatment will
perform closer to the mean.
If a business organisation has a highly profitable quarter,
despite the underlying reasons for its performance being unchanged, it
is likely to do less well the next quarter.
If the country's GDP jumps in one quarter it is likely not to do as well in the next.
Baseball players who hit well in their rookie season are likely to do worse their 2nd; the "Sophomore slump".
History
The concept of regression comes from genetics and was popularized by Sir Francis Galton during the late 19th century with the publication of Regression towards mediocrity in hereditary stature.
Galton observed that extreme characteristics (e.g., height) in parents
are not passed on completely to their offspring. Rather, the
characteristics in the offspring regress towards a mediocre
point (a point which has since been identified as the mean). By
measuring the heights of hundreds of people, he was able to quantify
regression to the mean, and estimate the size of the effect. Galton
wrote that, "the average regression of the offspring is a constant
fraction of their respective mid-parental
deviations". This means that the difference between a child and its
parents for some characteristic is proportional to its parents'
deviation from typical people in the population. If its parents are each
two inches taller than the averages for men and women, then, on
average, the offspring will be shorter than its parents by some factor
(which, today, we would call one minus the regression coefficient)
times two inches. For height, Galton estimated this coefficient to be
about 2/3: the height of an individual will measure around a midpoint
that is two thirds of the parents' deviation from the population
average.
Galton coined the term "regression" to describe an observable fact in the inheritance of multi-factorial quantitative genetic
traits: namely that the offspring of parents who lie at the tails of
the distribution will tend to lie closer to the centre, the mean, of the
distribution. He quantified this trend, and in doing so invented linear regression
analysis, thus laying the groundwork for much of modern statistical
modelling. Since then, the term "regression" has taken on a variety of
meanings, and it may be used by modern statisticians to describe
phenomena of sampling bias which have little to do with Galton's original observations in the field of genetics.
Though his mathematical analysis was correct, Galton's biological
explanation for the regression phenomenon he observed is now known to
be incorrect. He stated: "A child inherits partly from his parents,
partly from his ancestors. Speaking generally, the further his genealogy
goes back, the more numerous and varied will his ancestry become, until
they cease to differ from any equally numerous sample taken at
haphazard from the race at large."
This is incorrect, since a child receives its genetic make-up
exclusively from its parents. There is no generation-skipping in genetic
material: any genetic material from earlier ancestors must have passed
through the parents (though it may not have been expressed
in them). The phenomenon is better understood if we assume that the
inherited trait (e.g., height) is controlled by a large number of recessivegenes. Exceptionally tall individuals must be homozygous for increased height mutations at a large proportion of these loci.
But the loci which carry these mutations are not necessarily shared
between two tall individuals, and if these individuals mate, their
offspring will be on average homozygous for "tall" mutations on fewer
loci than either of their parents. In addition, height is not entirely
genetically determined, but also subject to environmental influences
during development, which make offspring of exceptional parents even
more likely to be closer to the average than their parents.
This population genetic
phenomenon of regression to the mean is best thought of as a
combination of a binomially distributed process of inheritance plus
normally distributed environmental influences. In contrast, the term
"regression to the mean" is now often used to describe the phenomenon by
which an initial sampling bias may disappear as new, repeated, or larger samples display sample means that are closer to the true underlying population mean.
Importance
Regression toward the mean is a significant consideration in the design of experiments.
Take a hypothetical example of 1,000 individuals of a similar age
who were examined and scored on the risk of experiencing a heart
attack. Statistics could be used to measure the success of an
intervention on the 50 who were rated at the greatest risk. The
intervention could be a change in diet, exercise, or a drug treatment.
Even if the interventions are worthless, the test group would be
expected to show an improvement on their next physical exam, because of
regression toward the mean. The best way to combat this effect is to
divide the group randomly into a treatment group that receives the
treatment, and a control
group that does not. The treatment would then be judged effective only
if the treatment group improves more than the control group.
Alternatively, a group of disadvantaged
children could be tested to identify the ones with most college
potential. The top 1% could be identified and supplied with special
enrichment courses, tutoring, counseling and computers. Even if the
program is effective, their average scores may well be less when the
test is repeated a year later. However, in these circumstances it may be
considered unethical to have a control group of disadvantaged children
whose special needs are ignored. A mathematical calculation for shrinkage can adjust for this effect, although it will not be as reliable as the control group method.
The effect can also be exploited for general inference and
estimation. The hottest place in the country today is more likely to be
cooler tomorrow than hotter, as compared to today. The best performing
mutual fund over the last three years is more likely to see relative
performance decline than improve over the next three years. The most
successful Hollywood actor of this year is likely to have less gross
than more gross for his or her next movie. The baseball player with the
greatest batting average by the All-Star break is more likely to have a
lower average than a higher average over the second half of the season.
Misunderstandings
The concept of regression toward the mean can be misused very easily.
In the student test example above, it was assumed implicitly that
what was being measured did not change between the two measurements.
Suppose, however, that the course was pass/fail and students were
required to score above 70 on both tests to pass. Then the students who
scored under 70 the first time would have no incentive to do well, and
might score worse on average the second time. The students just over 70,
on the other hand, would have a strong incentive to study and
concentrate while taking the test. In that case one might see movement away
from 70, scores below it getting lower and scores above it getting
higher. It is possible for changes between the measurement times to
augment, offset or reverse the statistical tendency to regress toward
the mean.
Statistical regression toward the mean is not a causal
phenomenon. A student with the worst score on the test on the first day
will not necessarily increase his score substantially on the second day
due to the effect. On average, the worst scorers improve, but that is
only true because the worst scorers are more likely to have been unlucky
than lucky. To the extent that a score is determined randomly, or that a
score has random variation or error, as opposed to being determined by
the student's academic ability or being a "true value", the phenomenon
will have an effect. A classic mistake in this regard was in education.
The students that received praise for good work were noticed to do more
poorly on the next measure, and the students who were punished for poor
work were noticed to do better on the next measure. The educators
decided to stop praising and keep punishing on this basis.
Such a decision was a mistake, because regression toward the mean is
not based on cause and effect, but rather on random error in a natural
distribution around a mean.
Although extreme individual measurements regress toward the mean, the second sample
of measurements will be no closer to the mean than the first. Consider
the students again. Suppose the tendency of extreme individuals is to
regress 10% of the way toward the mean of 80, so a student who scored 100 the first day is expected
to score 98 the second day, and a student who scored 70 the first day
is expected to score 71 the second day. Those expectations are closer to
the mean than the first day scores. But the second day scores will vary
around their expectations; some will be higher and some will be lower.
In addition, individuals that measure very close to the mean should
expect to move away from the mean. The effect is the exact reverse of
regression toward the mean, and exactly offsets it. So for extreme
individuals, we expect the second score to be closer to the mean than
the first score, but for all individuals, we expect the distribution of distances from the mean to be the same on both sets of measurements.
Related to the point above, regression toward the mean works
equally well in both directions. We expect the student with the highest
test score on the second day to have done worse on the first day. And if
we compare the best student on the first day to the best student on the
second day, regardless of whether it is the same individual or not,
there is a tendency to regress toward the mean going in either
direction. We expect the best scores on both days to be equally far from
the mean.
Regression fallacies
Many phenomena tend to be attributed to the wrong causes when regression to the mean is not taken into account.
An extreme example is Horace Secrist's 1933 book The Triumph of Mediocrity in Business,
in which the statistics professor collected mountains of data to prove
that the profit rates of competitive businesses tend toward the average
over time. In fact, there is no such effect; the variability of profit
rates is almost constant over time. Secrist had only described the
common regression toward the mean. One exasperated reviewer, Harold Hotelling,
likened the book to "proving the multiplication table by arranging
elephants in rows and columns, and then doing the same for numerous
other kinds of animals".
The calculation and interpretation of "improvement scores" on
standardized educational tests in Massachusetts probably provides
another example of the regression fallacy.
In 1999, schools were given improvement goals. For each school, the
Department of Education tabulated the difference in the average score
achieved by students in 1999 and in 2000. It was quickly noted that most
of the worst-performing schools had met their goals, which the
Department of Education took as confirmation of the soundness of their
policies. However, it was also noted that many of the supposedly best
schools in the Commonwealth, such as Brookline High School (with 18
National Merit Scholarship finalists) were declared to have failed. As
in many cases involving statistics and public policy, the issue is
debated, but "improvement scores" were not announced in subsequent years
and the findings appear to be a case of regression to the mean.
The psychologist Daniel Kahneman, winner of the 2002 Nobel Memorial Prize in Economic Sciences,
pointed out that regression to the mean might explain why rebukes can
seem to improve performance, while praise seems to backfire.
“
I had
the most satisfying Eureka experience of my career while attempting to
teach flight instructors that praise is more effective than punishment
for promoting skill-learning. When I had finished my enthusiastic
speech, one of the most seasoned instructors in the audience raised his
hand and made his own short speech, which began by conceding that
positive reinforcement might be good for the birds, but went on to deny
that it was optimal for flight cadets. He said, "On many occasions I
have praised flight cadets for clean execution of some aerobatic
maneuver, and in general when they try it again, they do worse. On the
other hand, I have often screamed at cadets for bad execution, and in
general they do better the next time. So please don't tell us that
reinforcement works and punishment does not, because the opposite is the
case." This was a joyous moment, in which I understood an important
truth about the world: because we tend to reward others when they do
well and punish them when they do badly, and because there is regression
to the mean, it is part of the human condition that we are
statistically punished for rewarding others and rewarded for punishing
them. I immediately arranged a demonstration in which each participant
tossed two coins at a target behind his back, without any feedback. We
measured the distances from the target and could see that those who had
done best the first time had mostly deteriorated on their second try,
and vice versa. But I knew that this demonstration would not undo the
effects of lifelong exposure to a perverse contingency.
”
To put Kahneman's story in simple terms, when one makes a severe
mistake, their performance will later usually return to average level
anyway. This will seem as an improvement and as "proof" of a belief that
it is better to criticize than to praise (held especially by anyone who
is willing to criticize at that "low" moment). In the contrary
situation, when one happens to perform high above average, their
performance will also tend to return to the average level later on; the
change will be perceived as a deterioration and any initial praise
following the first performance as a cause of that deterioration. Just
because criticizing or praising precedes the regression toward the mean,
the act of criticizing or of praising is falsely attributed causality.
The regression fallacy is also explained in Rolf Dobelli's The Art of Thinking Clearly.
UK law enforcement policies have encouraged the visible siting of static or mobile speed cameras at accident blackspots. This policy was justified by a perception that there is a corresponding reduction in serious road traffic accidents
after a camera is set up. However, statisticians have pointed out that,
although there is a net benefit in lives saved, failure to take into
account the effects of regression to the mean results in the beneficial
effects being overstated.
Statistical analysts have long recognized the effect of
regression to the mean in sports; they even have a special name for it:
the "sophomore slump". For example, Carmelo Anthony of the NBA's Denver Nuggets
had an outstanding rookie season in 2004. It was so outstanding, in
fact, that he could not possibly be expected to repeat it: in 2005,
Anthony's numbers had dropped from his rookie season. The reasons for
the "sophomore slump" abound, as sports are all about adjustment and
counter-adjustment, but luck-based excellence as a rookie is as good a
reason as any. Regression to the mean in sports performance may also be
the reason for the apparent "Sports Illustrated cover jinx" and the "Madden Curse". John Hollinger has an alternate name for the phenomenon of regression to the mean: the "fluke rule", while Bill James calls it the "Plexiglas Principle".
Because popular lore has focused on regression toward the mean as
an account of declining performance of athletes from one season to the
next, it has usually overlooked the fact that such regression can also
account for improved performance. For example, if one looks at the batting average of Major League Baseball
players in one season, those whose batting average was above the league
mean tend to regress downward toward the mean the following year, while
those whose batting average was below the mean tend to progress upward
toward the mean the following year.
Other statistical phenomena
Regression
toward the mean simply says that, following an extreme random event,
the next random event is likely to be less extreme. In no sense does the
future event "compensate for" or "even out" the previous event, though
this is assumed in the gambler's fallacy (and the variant law of averages). Similarly, the law of large numbers
states that in the long term, the average will tend towards the
expected value, but makes no statement about individual trials. For
example, following a run of 10 heads on a flip of a fair coin (a rare,
extreme event), regression to the mean states that the next run of heads
will likely be less than 10, while the law of large numbers states that
in the long term, this event will likely average out, and the average
fraction of heads will tend to 1/2. By contrast, the gambler's fallacy
incorrectly assumes that the coin is now "due" for a run of tails to
balance out.
Definition for simple linear regression of data points
This is the definition of regression toward the mean that closely follows Sir Francis Galton's original usage.
Suppose there are n data points {yi, xi}, where i = 1, 2, …, n. We want to find the equation of the regression line, i.e. the straight line
which would provide a "best" fit for the data points. (Note that a
straight line may not be the appropriate regression curve for the given
data points.) Here the "best" will be understood as in the least-squares approach: such a line that minimizes the sum of squared residuals of the linear regression model. In other words, numbers α and β solve the following minimization problem:
Find , where
Using calculus it can be shown that the values of α and β that minimize the objective function Q are
where rxy is the sample correlation coefficient between x and y, sx is the standard deviation of x, and sy is correspondingly the standard deviation of y. Horizontal bar over a variable means the sample average of that variable. For example:
Substituting the above expressions for and into yields fitted values
which yields
This shows the role rxy plays in the regression line of standardized data points.
If −1 < rxy < 1, then we say that
the data points exhibit regression toward the mean. In other words, if
linear regression is the appropriate model for a set of data points
whose sample correlation coefficient is not perfect, then there is
regression toward the mean. The predicted (or fitted) standardized value
of y is closer to its mean than the standardized value of x is to its mean.
Definitions for bivariate distribution with identical marginal distributions
Restrictive definition
Let X1, X2 be random variables with identical marginal distributions with mean μ. In this formalization, the bivariate distribution of X1 and X2 is said to exhibit regression toward the mean if, for every number c > μ, we have
μ ≤ E[X2 | X1 = c] < c,
with the reverse inequalities holding for c < μ.
The following is an informal description of the above definition. Consider a population of widgets. Each widget has two numbers, X1 and X2 (say, its left span (X1 ) and right span (X2)). Suppose that the probability distributions of X1 and X2 in the population are identical, and that the means of X1 and X2 are both μ. We now take a random widget from the population, and denote its X1 value by c. (Note that c may be greater than, equal to, or smaller than μ.) We have no access to the value of this widget's X2 yet. Let d denote the expected value of X2 of this particular widget. (i.e. Let d denote the average value of X2 of all widgets in the population with X1=c.) If the following condition is true:
Whatever the value c is, d lies between μ and c (i.e.d is closer to μ than c is),
then we say that X1 and X2 show regression toward the mean.
This definition accords closely with the current common usage,
evolved from Galton's original usage, of the term "regression toward the
mean." It is "restrictive" in the sense that not every bivariate
distribution with identical marginal distributions exhibits regression
toward the mean (under this definition).
Theorem
If a pair (X, Y) of random variables follows a bivariate normal distribution, then the conditional mean E(Y|X) is a linear function of X. The correlation coefficientr between X and Y, along with the marginal means and variances of X and Y, determines this linear relationship:
where E[X] and E[Y] are the expected values of X and Y, respectively, and σx and σy are the standard deviations of X and Y, respectively.
Hence the conditional expected value of Y, given that X is tstandard deviations above its mean (and that includes the case where it's below its mean, when t < 0), is rt standard deviations above the mean of Y. Since |r| ≤ 1, Y is no farther from the mean than X is, as measured in the number of standard deviations.
Hence, if 0 ≤ r < 1, then (X, Y) shows regression toward the mean (by this definition).
General definition
The following definition of reversion toward the mean has been proposed by Samuels as an alternative to the more restrictive definition of regression toward the mean above.
Let X1, X2 be random variables with identical marginal distributions with mean μ. In this formalization, the bivariate distribution of X1 and X2 is said to exhibit reversion toward the mean if, for every number c, we have
μ ≤ E[X2 | X1 > c] < E[X1 | X1 > c], and
μ ≥ E[X2 | X1 < c] > E[X1 | X1 < c]
This definition is "general" in the sense that every bivariate distribution with identical marginal distributions exhibits reversion toward the mean.