Selection bias is the bias
introduced by the selection of individuals, groups or data for analysis
in such a way that proper randomization is not achieved, thereby
ensuring that the sample obtained is not representative of the
population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical
analysis, resulting from the method of collecting samples. If the
selection bias is not taken into account, then some conclusions of the
study may be false.
Types
There are many types of possible selection bias, including:
Sampling bias
Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample, defined as a statistical sample of a population (or non-human factors) in which all participants are not equally balanced or objectively represented. It is mostly classified as a subtype of selection bias, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias.
A distinction of sampling bias (albeit not a universally accepted one) is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity
for differences or similarities found in the sample at hand. In this
sense, errors occurring in the process of gathering the sample or cohort
cause sampling bias, while errors in any process thereafter cause
selection bias.
Examples of sampling bias include self-selection,
pre-screening of trial participants, discounting trial subjects/tests
that did not run to completion and migration bias by excluding subjects
who have recently moved into or out of the study area.
Time interval
- Early termination of a trial at a time when its results support the desired conclusion.
- A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Exposure
- Susceptibility bias
- Clinical susceptibility bias, when one disease predisposes for a second disease, and the treatment for the first disease erroneously appears to predispose to the second disease. For example, postmenopausal syndrome gives a higher likelihood of also developing endometrial cancer, so estrogens given for the postmenopausal syndrome may receive a higher than actual blame for causing endometrial cancer.
- Protopathic bias, when a treatment for the first symptoms of a disease or other outcome appear to cause the outcome. It is a potential bias when there is a lag time from the first symptoms and start of treatment before actual diagnosis. It can be mitigated by lagging, that is, exclusion of exposures that occurred in a certain time period before diagnosis.
- Indication bias, a potential mixup between cause and effect when exposure is dependent on indication, e.g. a treatment is given to people in high risk of acquiring a disease, potentially causing a preponderance of treated people among those acquiring the disease. This may cause an erroneous appearance of the treatment being a cause of the disease.
Data
- Partitioning (dividing) data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions.
- Post hoc alteration of data inclusion based on arbitrary or subjective reasons, including:
- Cherry picking, which actually is not selection bias, but confirmation bias, when specific subsets of data are chosen to support a conclusion (e.g. citing examples of plane crashes as evidence of airline flight being unsafe, while ignoring the far more common example of flights that complete safely.)
- Rejection of bad data on (1) arbitrary grounds, instead of according to previously stated or generally agreed criteria or (2) discarding "outliers" on statistical grounds that fail to take into account important information that could be derived from "wild" observations.
Studies
- Selection of which studies to include in a meta-analysis (see also combinatorial meta-analysis).
- Performing repeated experiments and reporting only the most favorable results, perhaps relabelling lab records of other experiments as "calibration tests", "instrumentation errors" or "preliminary surveys".
- Presenting the most significant result of a data dredge as if it were a single experiment (which is logically the same as the previous item, but is seen as much less dishonest).
Attrition
Attrition bias is a kind of selection bias caused by attrition (loss of participants), discounting trial subjects/tests that did not run to completion. It is closely related to the survivorship bias, where only the subjects that "survived" a process are included in the analysis or the failure bias, where only the subjects that "failed" a process are included. It includes dropout, nonresponse (lower response rate), withdrawal and protocol deviators.
It gives biased results where it is unequal in regard to exposure
and/or outcome. For example, in a test of a dieting program, the
researcher may simply reject everyone who drops out of the trial, but
most of those who drop out are those for whom it was not working.
Different loss of subjects in intervention and comparison group may
change the characteristics of these groups and outcomes irrespective of
the studied intervention.
Observer selection
Philosopher Nick Bostrom
has argued that data are filtered not only by study design and
measurement, but by the necessary precondition that there has to be
someone doing a study. In situations where the existence of the observer
or the study is correlated with the data, observation selection effects
occur, and anthropic reasoning is required.
An example is the past impact event
record of Earth: if large impacts cause mass extinctions and ecological
disruptions precluding the evolution of intelligent observers for long
periods, no one will observe any evidence of large impacts in the recent
past (since they would have prevented intelligent observers from
evolving). Hence there is a potential bias in the impact record of
Earth. Astronomical existential risks might similarly be underestimated due to selection bias, and an anthropic correction has to be introduced.
Mitigation
In the general case, selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases. An assessment of the degree of selection bias can be made by examining correlations between exogenous (background) variables and a treatment indicator. However, in regression models, it is correlation between unobserved determinants of the outcome and unobserved
determinants of selection into the sample which bias estimates, and
this correlation between unobservables cannot be directly assessed by
the observed determinants of treatment.
Related issues
Selection bias is closely related to:
- publication bias or reporting bias, the distortion produced in community perception or meta-analyses by not publishing uninteresting (usually negative) results, or results which go against the experimenter's prejudices, a sponsor's interests, or community expectations.
- confirmation bias, the general tendency of humans to give more attention to whatever confirms our pre-existing perspective; or specifically in experimental science, the distortion produced by experiments that are designed to seek confirmatory evidence instead of trying to disprove the hypothesis.
- exclusion bias, results from applying different criteria to cases and controls in regards to participation eligibility for a study/different variables serving as basis for exclusion.