Old
age is, we know, a gauntlet of chronic illness that almost no one gets
through without some deep unpleasantness. Most people who reach the
upper end of the average human lifespan begin, at some point, to
accumulate diseases. For the most lethal maladies of the elderly — heart
disease and cancer — the relationship between age and disease is
logarithmic. As we grow older, our risk of contracting a chronic disease
doesn’t just increase—it accelerates.
Michael
Cantor would like to avoid this fate. He’s not a fanatic—not the type
to haunt biohacking subreddits for self-quantification tips or take
dozens of unproven anti-aging treatments in the off chance one will buy
him some yardage. Cantor is a patent lawyer with a prominent practice in
West Hartford, Connecticut, where his wife is the mayor. “I don’t even
like to take aspirin,” he says. “I’m very nervous to do anything with
respect to any other kinds of drugs.” But still, if there were a
reasonable way to stave off death — he’d like to try it.
A
decade ago, on a cycling trip in Bordeaux, Cantor met a man named Nir
Barzilai, and the two became friends. A former Israeli army medic, now
director of the Institute for Aging Research at the Albert Einstein
College of Medicine in New York, Barzilai has become a globetrotting
evangelist for what is known as the “geroscience hypothesis”: the idea
that if you use drugs to target the underlying biological mechanisms
that drive aging, you can also delay, or even altogether prevent, the
cascade of disease that accompanies the end of the typical human life.
For
the past several years, an effort has been underway in the field of
gerontology to get a drug targeting aging approved by the FDA, with
Barzilai leading the charge. The drug in question is no new wonder pill,
but a diabetes medication called metformin: an ordinary, generic,
typically chalky-white pill that costs a few pennies apiece.
Cantor
was already familiar with metformin, but not as an anti-aging remedy;
he had been prescribed it by a local weight-loss clinic. A few years
ago, intrigued by his friend’s work on aging, he broached the subject of
longevity.
Metformin, Barzilai and his team believe, will be the first drug ever to be officially approved to treat aging.
“Over
dinner one night, I said to Nir, ‘You know, I’m leaving this
weight-loss clinic, which means they’re going to stop prescribing
metformin. But I think I’d like to stay on it,’” Cantor says. Barzilai
wrote the prescription himself, and Cantor now takes metformin off-label
in hopes that it will grant him a few extra years of life, or at least
of health.
“I’ve
bought into this idea — it’s not just an idea, it’s a fact — that you
don’t really die of old age,” Cantor says. “Most people die of
age-related disease.”
Although
metformin’s age-extending powers are unproven, Cantor doesn’t mind
being a guinea pig. He thinks the drug is subtly improving his health.
“My experience is only positive,” he says. “It’s changed my metabolism.”
In
the world of anti-aging research, there are any number of exciting
potions in the pharmaceutical pipeline with the purported potential to
extend the average human lifespan by several years or more. There are
senolytics, which act like snipers, snuffing out old cells that have
stopped dividing and have begun to secrete destructive inflammatory
cytokines. Rapamycin, an immunosuppressant derived from an Easter Island
bacterium that has lab mice living 25 percent longer than their
unmedicated peers.
Metformin,
by contrast, is decidedly unsexy. Currently the eighth most-prescribed
drug in the United States, metformin has been plodding along in the
medical world since the 1950s, when French doctors began using it to
help diabetics keep their blood sugar under control. Scientists are now
looking at it in a new light, ever since diabetes researchers observed
that it appears to extend the lifespan of medicated patients slightly
beyond that of ordinary nondiabetics—just enough to notice an effect,
but nothing radical.
“Metformin is basically the first and the weakest drug that will delay aging.”
Soon,
Barzilai and a team of gerontologists from 14 top aging research
centers will begin a clinical trial to study the effects of metformin on
aging. The trial will be an ambitious six-year, $77 million effort,
involving 3,000 patients and many of the brightest lights in the field
of gerontology. Metformin, Barzilai and his team believe, will be the
first drug ever to be officially approved to treat aging in the 112-year
history of the FDA. Along the way, they hope to achieve nothing short
of radical transformation in the way we think of health care for the
aging.
Metformin
as an old-age remedy is a high-stakes bet, and its supporters are
putting everything on the table. If the clinical trial proves to be a
bust, the effort to develop age-treating drugs will be set back years,
if not longer.
For
all its importance to the emerging science on aging, metformin is a
very boring drug. Even the name of the upcoming clinical trial has all
the inherent hype of a wet blanket: TAME, an acronym that stands for
“Targeting Aging with Metformin.” On its face, it seems unlikely that
such a plain little pharmaceutical would be the focus of such a critical
trial. Since the patent is expired, no drug company stands to make
billions off it, and it’s certainly not going to make anyone live
forever. Even Barzilai, who has essentially staked his career on the
lifespan-extending prospects of metformin, has a hard time mustering a
lot of enthusiasm about the drug itself. “Metformin is basically the
first and the weakest drug that will delay aging,” he says.
But
it is, in fact, metformin’s stodginess that makes it such a good
candidate for the first anti-aging clinical trial. Using drugs to treat
age is a venture into uncharted waters for the FDA; in order to convince
federal regulators to take that step, TAME’s backers had to choose a
drug that held no surprises. A safe drug, a known drug, a cheap drug, a
drug that doctors prescribe to millions of patients every year without a
qualm — that kind of drug could be the tip of the spear for aging
research, opening a new path at the FDA to the approval of more daring
drugs and emboldening pharmaceutical companies to join the hunt for
metformin’s successors.
If aging isn’t a disease, the FDA has no regulatory authority to approve treatment for it.
The
TAME trial has been in the works for years. With about a third of its
funding secured from the nonprofit American Federation of Aging Research
and a major grant in the works from the National Institutes of Health,
the trial is nearly ready to begin. To its main sponsors, some of whom
are approaching their golden years themselves, progress has felt
agonizingly slow. Before pursuing the funding needed to pull this off,
the TAME researchers had to sell federal regulators, and indeed the
wider medical research field, not just on metformin, but on the very
concept of testing a drug for the malady of age.
Scientists
involved with TAME say the main obstacle in their path is that the
regulatory world just hasn’t caught up to the emerging science. In
recent years, they say, science has learned quite a lot about how aging
works and how drugs might be used to target it. The reason we don’t have
a drug for aging yet isn’t that we haven’t come up with any good
candidates. It’s that aging isn’t technically a disease.
“The
FDA, its mandate says very clearly, is to regulate medications and
devices used in the diagnosis and treatment of diseases,” says scientist
George Kuchel, director of the University of Connecticut Center on
Aging and deputy editor of the Journal of the American Geriatrics Society.
If
aging isn’t a disease, the FDA has no regulatory authority to approve
treatment for it. And if the FDA won’t approve treatments,
pharmaceutical companies won’t make them.
Absent
an act of Congress, the FDA’s mandate isn’t likely to change to
accommodate the goals of gerontologists. The TAME project seeks to make
an end run around the whole logical conundrum by seeking approval to
target a “composite of age-related diseases,” Barzilai says.
The prospect of radically altering the human lifespan raises all sorts of bioethical questions.
In
other words: If they can show that by targeting the underlying
biological pathways that make an aging body more susceptible to disease,
metformin can stave off a host of chronic ills, the TAME researchers
can get the FDA’s blessing.
Once
metformin has the official labeling language to back up scientists’
anti-aging claims for the drug, the theory goes, pharmaceutical
companies will be inspired to invest in more daring ventures — and then
the truly miraculous drugs, the ones that could have us living healthily
to 115 and beyond, might begin to flow through the pipeline.
The
prospect of radically altering the human lifespan raises all sorts of
bioethical questions. Will longer-lived people consume too many
resources? Will policy be able to catch up with our changing lifespans?
Will the best medical technology be unevenly distributed, like pretty
much everything else is?
Barzilai has heard the arguments for years. He doesn’t think much of them.
“You know, I get kind of frustrated when people are asking me if it’s ethical that people will be healthier,” he says.
For
a scientist, Barzilai is a good talker. People who’ve spent time with
him in person talk about his puckish sense of humor, his boyish
ebullience; Barzilai “could be the older brother to Mike Myers’ Austin
Powers character,” as author Bill Gifford puts it.
On
the phone, Barzilai is charming and enthusiastic, ever ready with the
elevator pitch. I warn him that I have a provocative question: What is
aging?
“Yeah, it is provocative,” he says. “The true answer will take much longer than your deadline.”
Defining
aging sounds simple, until you have to do it. Pare away numerical age
and the diseases that are the result of the aging process, and what is
left? What is it that a tremulous, bedridden 70-year-old has that a spry
90-year-old doesn’t? Is it a loss of some measurable function? A
molecule you can scan for in the blood? Certain genes turning off or on
over time?
“It’s
like, ‘What is porn?’ Right? You know it when you see it,” Barzilai
says. “We all see what is aging, okay, without understanding what it
is.”
If you’re looking at a person’s age in years, age is literally just a number, Kuchel says.
“It’sbasically
the number of times that the earth has turned around the sun during
your life. That’s easy to measure. What’s harder to measure is what we
call physiological aging or biological aging,” he says.
The
TAME trial will look at biomarkers, or measurable indicators that a
process is happening in the body that are known to be associated with
aging. But the focus of the study is more broad: TAME mainly seeks to
answer the question of whether metformin will help lengthen life and
stave off disease.
To
dig deeper into what, exactly, metformin is doing in the body,
researchers have been working on a series of smaller, faster studies
designed to home in on how it interacts with pathways known to be
associated with aging. The first of these, a study dubbed the “Metformin
in Longevity Study,” or MILES, was completed and published in 2017.
Barzilai is confident the drug has a beneficial effect on the biological mechanisms underlying aging.
The
MILES study looked at just 14 patients and ran for only six weeks, but
it gathered a wealth of precious data thanks to the patients themselves,
who had to undergo grueling muscle and fat biopsies. The invasive
protocols of the MILES study have yielded valuable results. Using each
participant as their own control — comparing biomarkers in the same
patient before and after metformin treatment — allowed the MILES
researchers to get more statistical power than they would otherwise with
a cohort this small.
“We
have some idea of what metformin might be doing at the physiological
level, but at the molecular level, we don’t really know what to look at.
And this might be some of the first evidence of what we can see,” says
Ameya Kulkarni, a PhD student in Barzilai’s lab and the main author on
the MILES study. His work on MILES, Kulkarni says, will help the TAME
researchers identify biomarkers to focus on the broader clinical trial
and tease out exactly how the drug is interacting with the aging
process.
Based
on what scientists have already observed with diabetic and prediabetic
patients taking metformin, Barzilai is confident the drug has a
beneficial effect, however modest, on the biological mechanisms
underlying aging. But it’s still important to do the work. All this
careful evidence building for metformin will serve as a road map for how
to build the case for the next anti-aging drug, and the next, and every
one that follows after.
“Once
the pharmaceuticals are going to come in, we’re going to get many
drugs, combinations of drugs. We can really move the needle
substantially. It will be unbelievable,” he says.
At that, Barzilai pauses to curb his own enthusiasm. “I didn’t like what I just said—it sounded like the president.”
The future gerontologists want is tantalizingly close.
In
2015, when they first began cautiously feeling out FDA regulators about
the prospect of getting a drug approved to treat aging, Barzilai and
fellow believers in the geroscience hypothesis feared that the
regulatory hurdles in their path might be too high. Now, with three
years of negotiations under their belts, the TAME scientists finally
have enough confidence in the agency to proceed.
“The
bottom line is that the FDA has bought into the idea,” Kuchel says.
“That is, these geroscience-guided therapies — they very much consider
that to be within their mandate.That’s kind of changed the whole landscape.”
The
FDA hasn’t been the only obstacle in TAME’s path. Paradoxically, the
study’s sheer ambition has worked against it in the search for federal
funding: TAME involves so many people working at the forefront of
gerontology research that it has been difficult to find peers to
properly peer-review the grant proposals. The first TAME proposal to the
National Institutes of Health was soundly rejected; the group has
submitted a second proposal and is hopeful that the federal agency will
ultimately fund roughly two-thirds of the study.
Many
of the top scientists in gerontology are involved in TAME, and their
colleagues at their home institutions are excluded from reviewing the
study, on grounds of conflict of interest. It’s been a problem for them
in seeking grants, says Jamie Justice, an assistant professor of
gerontology at Wake Forest School of Medicine.
Peer
review is technically anonymous, but by process of elimination, Justice
says, most of the people sitting in judgement on TAME’s funding
prospects aren’t gerontologists; they’re experts in specific diseases.
That’s the exact research approach TAME’s backers are hoping to upend.
“It’s
hard to find peers that aren’t in conflict with the trial and get the
concept,” Justice says, adding that for many reviewers, TAME is
“challenging the way they’ve run their entire career.”
The
silos that separate disease experts from one another permeate the
entire health care system, from academic and pharmaceutical research all
the way down to patient care. Gerontologists hope that studies like
TAME, which focus on health outcomes for a person rather than the
incidence of a specific disease, will begin to help break down those
barriers.
“This
is a huge issue that we confront in geriatric care,” Kuchel says. “How
to provide health care that’s really based on the whole person, rather
than the bits and parts of them.”
Scientists
are often far more cautious about using hyped-up language than the
reporters who write about them are. But the researchers involved with
TAME talk about the study in grandiose terms; they use words like revolutionary and groundbreaking. Some talk of it as a “paradigm shift,” invoking the specter of philosopher Thomas Kuhn.
“We’re
using metformin as a tool to develop a whole new science of how to
study aging in a way that is going to be transformational,” Kuchel says.
Kuchel
is fond of that famous aphorism by 19th-century philosopher Arthur
Schopenhauer: “All truth passes through three stages. First, it is
ridiculed. Second, it is violently opposed. Third, it is accepted as
self-evident.”
The idea that age can be treated with drugs, Kuchel says, is currently somewhere between stages two and three.
The
world may finally be ready for an old-age pill, but the gears of drug
development turn slowly. Too slowly for Barzilai, who’s anxious to get
beyond TAME and get on with the process of real discovery.
“It’s
happening. It’s happening, but it’s frustrating how long it takes,” he
says. “This is possible, but we need to start it everywhere we can and
not let another generation be affected by aging without health.”
A statistical hypothesis, sometimes called confirmatory data analysis, is an hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference.
Commonly, two statistical data sets are compared, or a data set
obtained by sampling is compared against a synthetic data set from an
idealized model. A hypothesis is proposed for the statistical
relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets. The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis
according to a threshold probability—the significance level. Hypothesis
tests are used in determining what outcomes of a study would lead to a
rejection of the null hypothesis for a pre-specified level of
significance. The process of distinguishing between the null hypothesis
and the alternative hypothesis is aided by identifying two conceptual types of errors, type 1 and type 2, and by specifying parametric limits on e.g. how much type 1 error will be permitted.
An alternative framework for statistical hypothesis testing is to specify a set of statistical models, one for each candidate hypothesis, and then use model selection techniques to choose the most appropriate model. The most common selection techniques are based on either Akaike information criterion or Bayes factor.
Confirmatory data analysis can be contrasted with exploratory data analysis, which may not have pre-specified hypotheses.
Variations and sub-classes
Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference,
although the two types of inference have notable differences.
Statistical hypothesis tests define a procedure that controls (fixes)
the probability of incorrectly deciding that a default position (null hypothesis)
is incorrect. The procedure is based on how likely it would be for a
set of observations to occur if the null hypothesis were true. Note that
this probability of making an incorrect decision is not the
probability that the null hypothesis is true, nor whether any specific
alternative hypothesis is true. This contrasts with other possible
techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis.
One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability, but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory,
attempt to balance the consequences of incorrect decisions across all
possibilities, rather than concentrating on a single null hypothesis. A
number of other approaches to reaching a decision based on data are
available via decision theory and optimal decisions,
some of which have desirable properties. Hypothesis testing, though, is
a dominant approach to data analysis in many fields of science.
Extensions to the theory of hypothesis testing include the study of the power
of tests, i.e. the probability of correctly rejecting the null
hypothesis given that it is false. Such considerations can be used for
the purpose of sample size determination prior to the collection of data.
The testing process
In the statistics literature, statistical hypothesis testing plays a fundamental role. The usual line of reasoning is as follows:
There is an initial research hypothesis of which the truth is unknown;
The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process;
The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence
or about the form of the distributions of the observations. This is
equally important as invalid assumptions will mean that the results of
the test are invalid;
Decide which test is appropriate, and state the relevant test statisticT;
Derive the distribution of the test statistic under the null
hypothesis from the assumptions. In standard cases this will be a
well-known result. For example, the test statistic might follow a Student's t distribution or a normal distribution;
Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%;
The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is α;
Compute from the observations the observed value tobs of the test statistic T;
Decide to either reject the null hypothesis in favor of the
alternative or not reject it. The decision rule is to reject the null
hypothesis H0 if the observed value tobs is in the critical region, and to accept or "fail to reject" the hypothesis otherwise.
An alternative process is commonly used:
Compute from the observations the observed value tobs of the test statistic T;
Calculate the p-value.
This is the probability, under the null hypothesis, of sampling a test
statistic at least as extreme as that which was observed;
Reject the null hypothesis, in favor of the alternative hypothesis,
if and only if the p-value is less than the significance level (the
selected probability) threshold.
The two processes are equivalent.
The former process was advantageous in the past when only tables of
test statistics at common probability thresholds were available. It
allowed a decision to be made without the calculation of a probability.
It was adequate for classwork and for operational use, but it was
deficient for reporting results.
The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a
probability is useful for reporting. The calculations are now trivially performed with appropriate software.
The difference in the two processes applied to the Radioactive suitcase example (below):
"The Geiger-counter reading is 10. The limit is 9. Check the suitcase;"
"The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase."
The former report is adequate, the latter gives a more detailed
explanation of the data and the reason why the suitcase is being
checked.
It is important to note the difference between accepting the null
hypothesis and simply failing to reject it. The "fail to reject"
terminology highlights the fact that the null hypothesis is assumed to
be true from the start of the test; if there is a lack of evidence
against it, it simply continues to be assumed true. The phrase "accept
the null hypothesis" may suggest it has been proved simply because it
has not been disproved, a logical fallacy known as the argument from ignorance. Unless a test with particularly high power
is used, the idea of "accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where
the meaning actually intended is well understood.
The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations.
It is particularly critical that appropriate sample sizes be estimated before conducting the experiment.
The phrase "test of significance" was coined by statistician Ronald Fisher.
Interpretation
The p-value
is the probability that a given result (or a more significant result)
would occur under the null hypothesis. For example, say that a fair coin
is tested for fairness (the null hypothesis). At a significance level
of 0.05, the fair coin would be expected to (incorrectly) reject the
null hypothesis in about 1 out of every 20 tests. The p-value does not provide the probability that either hypothesis is correct (a common source of confusion).
If the p-value is less than the chosen significance
threshold (equivalently, if the observed test statistic is in the
critical region), then we say the null hypothesis is rejected at the
chosen level of significance. Rejection of the null hypothesis is a
conclusion. This is like a "guilty" verdict in a criminal trial: the
evidence is sufficient to reject innocence, thus proving guilt. We might
accept the alternative hypothesis (and the research hypothesis).
If the p-value is not less than the chosen
significance threshold (equivalently, if the observed test statistic is
outside the critical region), then the evidence is insufficient to
support a conclusion. (This is similar to a "not guilty" verdict.) The
researcher typically gives extra consideration to those cases where the p-value is close to the significance level.
Some people find it helpful to think of the hypothesis testing framework as analogous to a mathematical proof by contradiction.
In the Lady tasting tea example (below), Fisher required the Lady
to properly categorize all of the cups of tea to justify the conclusion
that the result was unlikely to result from chance. His test revealed
that if the lady was effectively guessing at random (the null
hypothesis), there was a 1.4% chance that the observed results
(perfectly ordered tea) would occur.
Whether rejection of the null hypothesis truly justifies
acceptance of the research hypothesis depends on the structure of the
hypotheses. Rejecting the hypothesis that a large paw print originated
from a bear does not immediately prove the existence of Bigfoot.
Hypothesis testing emphasizes the rejection, which is based on a
probability, rather than the acceptance, which requires extra steps of
logic.
"The probability of rejecting the null hypothesis is a function
of five factors: whether the test is one- or two tailed, the level of
significance, the standard deviation, the amount of deviation from the
null hypothesis, and the number of observations."
These factors are a source of criticism; factors under the control of
the experimenter/analyst give the results an appearance of subjectivity.
Use and importance
Statistics
are helpful in analyzing most collections of data. This is equally true
of hypothesis testing which can justify conclusions even when no
scientific theory exists. In the Lady tasting tea example, it was
"obvious" that no difference existed between (milk poured into tea) and
(tea poured into milk). The data contradicted the "obvious".
Real world applications of hypothesis testing include:
Testing whether more men than women suffer from nightmares;
Establishing authorship of documents;
Evaluating the effect of the full moon on behavior;
Determining the range at which a bat can detect an insect by echo;
Deciding whether hospital carpeting results in more infections;
Selecting the best means to stop smoking;
Checking whether bumper stickers reflect car owner behavior;
Testing the claims of handwriting analysts.
Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference.
For example, Lehmann (1992) in a review of the fundamental paper by
Neyman and Pearson (1933) says: "Nevertheless, despite their
shortcomings, the new paradigm formulated in the 1933 paper, and the
many developments carried out within its framework continue to play a
central role in both the theory and practice of statistics and can be
expected to do so in the foreseeable future".
Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s). Other fields have favored the estimation of parameters (e.g., effect size).
Significance testing is used as a substitute for the traditional
comparison of predicted value and experimental result at the core of the
scientific method.
When theory is only capable of predicting the sign of a relationship, a
directional (one-sided) hypothesis test can be configured so that only a
statistically significant result supports theory. This form of theory
appraisal is the most heavily criticized application of hypothesis
testing.
Cautions
"If
the government required statistical procedures to carry warning labels
like those on drugs, most inference methods would have long labels
indeed." This caution applies to hypothesis tests and alternatives to them.
The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong.
The conclusion of the test is only as solid as the sample upon
which it is based. The design of the experiment is critical. A number of
unexpected effects have been observed including:
The clever Hans effect. A horse appeared to be capable of doing simple arithmetic;
The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse;
The placebo effect. Pills with no medically active ingredients were remarkably effective.
A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting
for example, there is no agreement on a measure of forecast accuracy.
In the absence of a consensus measurement, no decision based on
measurements will be without controversy.
The book How to Lie with Statistics is the most popular book on statistics ever published.
It does not much consider hypothesis
testing, but its cautions are applicable, including: Many claims are
made on the basis of samples too small to convince. If a report does not
mention sample size, be doubtful.
Hypothesis testing acts as a filter of statistical conclusions;
only those results meeting a probability threshold are publishable.
Economics also acts as a publication filter; only those results
favorable to the author and funding source may be submitted for
publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining),
in which a variety of tests for a variety of possible effects are
applied to a single data set and only those yielding a significant
result are reported. These are often dealt with by using multiplicity
correction procedures that control the family wise error rate (FWER) or the false discovery rate (FDR).
Those making critical decisions based on the results of a
hypothesis test are prudent to look at the details rather than the
conclusion alone. In the physical sciences most results are fully
accepted only when independently confirmed. The general advice
concerning statistics is, "Figures never lie, but liars figure"
(anonymous).
Examples
Human sex ratio
The earliest use of statistical hypothesis testing is generally
credited to the question of whether male and female births are equally
likely (null hypothesis), which was addressed in the 1700s by John Arbuthnot (1710), and later by Pierre-Simon Laplace (1770s).
Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the sign test, a simple non-parametric test.
In every year, the number of males born in London exceeded the number
of females. Considering more male or more female births as equally
likely, the probability of the observed outcome is 0.582, or about 1 in 4,8360,0000,0000,0000,0000,0000; in modern terms, this is the p-value.
This is vanishingly small, leading Arbuthnot that this was not due to
chance, but to divine providence: "From whence it follows, that it is
Art, not Chance, that governs." In modern terms, he rejected the null
hypothesis of equally likely male and female births at the p = 1/282 significance level.
Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.
Lady tasting tea
In a famous example of hypothesis testing, known as the Lady tasting tea, Dr. Muriel Bristol,
a female colleague of Fisher claimed to be able to tell whether the tea
or the milk was added first to a cup. Fisher proposed to give her eight
cups, four of each variety, in random order. One could then ask what
the probability was for her getting the number she got correct, but just
by chance. The null hypothesis was that the Lady had no such ability.
The test statistic was a simple count of the number of successes in
selecting the 4 cups. The critical region was the single case of 4
successes of 4 possible based on a conventional probability criterion
(< 5%; 1 of 70 ≈ 1.4%). Fisher asserted that no alternative
hypothesis was (ever) required. The lady correctly identified every cup, which would be considered a statistically significant result.
Courtroom trial
A statistical test procedure is comparable to a criminal trial;
a defendant is considered not guilty as long as his or her guilt is not
proven. The prosecutor tries to prove the guilt of the defendant. Only
when there is enough evidence for the prosecution is the defendant
convicted.
In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and : "the defendant is guilty". The first one, , is called the null hypothesis, and is for the time being accepted. The second one, , is called the alternative hypothesis. It is the alternative hypothesis that one hopes to support.
The hypothesis of innocence is only rejected when an error is
very unlikely, because one doesn't want to convict an innocent
defendant. Such an error is called error of the first kind
(i.e., the conviction of an innocent person), and the occurrence of
this error is controlled to be rare. As a consequence of this asymmetric
behaviour, an error of the second kind (acquitting a person who committed the crime), is more common.
H0 is true Truly not guilty
H1 is true Truly guilty
Accept null hypothesis Acquittal
Right decision
Wrong decision Type II Error
Reject null hypothesis Conviction
Wrong decision Type I Error
Right decision
A criminal trial can be regarded as either or both of two decision
processes: guilty vs not guilty or evidence vs a threshold ("beyond a
reasonable doubt"). In one view, the defendant is judged; in the other
view the performance of the prosecution (which bears the burden of
proof) is judged. A hypothesis test can be regarded as either a judgment
of a hypothesis or as a judgment of evidence.
Philosopher's beans
The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was
formalized and popularized.
Few beans of this handful are white.
Most beans in this bag are white.
Therefore: Probably, these beans were taken from another bag.
This is an hypothetical inference.
The beans in the bag are the population. The handful are the sample.
The null hypothesis is that the sample originated from the population.
The criterion for rejecting the null-hypothesis is the "obvious"
difference in appearance (an informal difference in the mean). The
interesting result is that consideration of a real population and a real
sample produced an imaginary bag. The philosopher was considering logic
rather than probability. To be a real statistical hypothesis test, this
example requires the formalities of a probability calculation and a
comparison of that probability to a standard.
A simple generalization of the example considers a mixed bag of
beans and a handful that contain either very few or very many white
beans. The generalization considers both extremes. It requires more
calculations and more comparisons to arrive at a formal answer, but the
core philosophy is unchanged; If the composition of the handful is
greatly different from that of the bag, then the sample probably
originated from another bag. The original example is termed a one-sided
or a one-tailed test while the generalization is termed a two-sided or
two-tailed test.
The statement also relies on the inference that the sampling was
random. If someone had been picking through the bag to find white
beans, then it would explain why the handful had so many white beans,
and also explain why the number of white beans in the bag was depleted
(although the bag is probably intended to be assumed much larger than
one's hand).
Clairvoyant card game
A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X.
As we try to find evidence of his clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant.
If the null hypothesis is valid, the only thing the test person
can do is guess. For every card, the probability (relative frequency) of
any single suit appearing is 1/4. If the alternative is valid, the test
subject will predict the suit correctly with probability greater than
1/4. We will call the probability of guessing correctly p. The hypotheses, then, are:
null hypothesis (just guessing)
and
alternative hypothesis (true clairvoyant).
When the test subject correctly predicts all 25 cards, we will
consider him clairvoyant, and reject the null hypothesis. Thus also with
24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no
cause to consider him so. But what about 12 hits, or 17 hits? What is
the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? With the choice c=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with c=10.
In the first case almost no test subjects will be recognized to be
clairvoyant, in the second case, a certain number will pass the test. In
practice, one decides how critical one will be. That is, one decides
how often one accepts an error of the first kind – a false positive, or Type I error. With c = 25 the probability of such an error is:
and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times.
Being less critical, with c=10, gives:
Thus, c = 10 yields a much greater probability of false positive.
Before the test is actually performed, the maximum acceptable probability of a Type I error (α)
is determined. Typically, values in the range of 1% to 5% are selected.
(If the maximum acceptable error rate is zero, an infinite number of
correct guesses is required.) Depending on this Type 1 error rate, the
critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus:
From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .
Radioactive suitcase
As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter,
it produces 10 counts per minute. The null hypothesis is that no
radioactive material is in the suitcase and that all measured counts are
due to ambient radioactivity typical of the surrounding air and
harmless objects. We can then calculate how likely it is that we would
observe 10 counts per minute if the null hypothesis were true. If the
null hypothesis predicts (say) on average 9 counts per minute, then
according to the Poisson distribution typical for radioactive decay
there is about 41% chance of recording 10 or more counts. Thus we can
say that the suitcase is compatible with the null hypothesis (this does
not guarantee that there is no radioactive material, just that we don't
have enough evidence to suggest there is). On the other hand, if the
null hypothesis predicts 3 counts per minute (for which the Poisson
distribution predicts only 0.1% chance of recording 10 or more counts)
then the suitcase is not compatible with the null hypothesis, and there
are likely other factors responsible to produce the measurements.
The test does not directly assert the presence of radioactive material. A successful
test asserts that the claim of no radioactive material present is
unlikely given the reading (and therefore ...). The double negative
(disproving the null hypothesis) of the method is confusing, but using a
counter-example to disprove is standard mathematical practice. The
attraction of the method is its practicality. We know (from experience)
the expected range of counts with only ambient radioactivity present, so
we can say that a measurement is unusually large. Statistics
just formalizes the intuitive by using numbers instead of adjectives. We
probably do not know the characteristics of the radioactive suitcases;
We just assume
that they produce larger readings.
To slightly formalize intuition: radioactivity is suspected if
the Geiger-count with the suitcase is among or exceeds the greatest (5%
or 1%) of the Geiger-counts made with ambient radiation alone. This
makes no assumptions about the distribution of counts. Many ambient
radiation observations are required to obtain good probability estimates
for rare events.
The test described here is more fully the null-hypothesis
statistical significance test. The null hypothesis represents what we
would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample
is unlikely to have occurred by chance if the null hypothesis were
true. The name of the test describes its formulation and its possible
outcome. One characteristic of the test is its crisp decision: to reject
or not reject the null hypothesis. A calculated value is compared to a
threshold, which is determined from the tolerable risk of error.
Definition of terms
The following definitions are mainly based on the exposition in the book by Lehmann and Romano:
Statistical hypothesis
A statement about the parameters describing a population (not a sample).
Statistic
A value calculated from a sample, often to summarize the sample for comparison purposes.
Simple hypothesis
Any hypothesis which specifies the population distribution completely.
Composite hypothesis
Any hypothesis which does not specify the population distribution completely.
The test's probability of correctly rejecting the null hypothesis. The complement of the false negative rate, β. Power is termed sensitivity in biostatistics.
("This is a sensitive test. Because the result is negative, we can
confidently say that the patient does not have the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive
rate. For composite hypotheses this is the supremum of the probability
of rejecting the null hypothesis over all cases covered by the null
hypothesis. The complement of the false positive rate is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and Type I and type II errors for exhaustive definitions.
Significance level of a test (α)
It is the upper bound imposed on the size of a test. Its value is
chosen by the statistician prior to looking at the data or choosing any
particular test to be used. It is the maximum exposure to erroneously
rejecting H0 he/she is ready to accept. Testing H0 at significance level α means testing H0 with a test whose size does not exceed α. In most cases, one uses tests whose size is equal to the significance level.
A predecessor to the statistical hypothesis test (see the Origins
section). An experimental result was said to be statistically
significant if a sample was sufficiently inconsistent with the (null)
hypothesis. This was variously considered common sense, a pragmatic
heuristic for identifying meaningful experimental results, a convention
establishing a threshold of statistical evidence or a method for drawing
conclusions from data. The statistical hypothesis test added
mathematical rigor and philosophical consistency to the concept by
making the alternative hypothesis explicit. The term is loosely used to
describe the modern version which is now part of statistical hypothesis
testing.
Conservative test
A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level.
A test in which the significance level or critical value can be
computed exactly, i.e., without any approximation. In some contexts this
term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities.
A statistical hypothesis test compares a test statistic (z or t
for examples) to a threshold. The test statistic (the formula found in
the table below) is based on optimality. For a fixed level of Type I
error rate, use of these statistics minimizes Type II error rates
(equivalent to maximizing power). The following terms describe tests in
terms of such optimality:
Most powerful test
For a given size or significance level, the test with
the greatest power (probability of rejection) for a given value of the
parameter(s) being tested, contained in the alternative hypothesis.
A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis.
Common test statistics
History
Early use
While
hypothesis testing was popularized early in the 20th century, early
forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth.
Fisher was an agricultural statistician who emphasized rigorous
experimental design and methods to extract a result from few samples
assuming Gaussian distributions. Neyman (who teamed with the younger
Pearson) emphasized mathematical rigor and methods to obtain more
results from many samples and a wider range of distributions. Modern
hypothesis testing is an inconsistent hybrid of the Fisher vs
Neyman/Pearson formulation, methods and terminology developed in the
early 20th century.
Fisher popularized the "significance test". He required a
null-hypothesis (corresponding to a population frequency distribution)
and a sample. His (now familiar) calculations determined whether to
reject the null-hypothesis or not. Significance testing did not utilize
an alternative hypothesis so there was no concept of a Type II error.
The p-value was devised as an informal, but objective, index
meant to help a researcher determine (based on other knowledge) whether
to modify future experiments or strengthen one's faith in the null hypothesis.
Hypothesis testing (and Type I/II errors) was devised by Neyman and
Pearson as a more objective alternative to Fisher's p-value, also meant
to determine researcher behaviour, but without requiring any inductive
inference by the researcher.
Neyman & Pearson considered a different problem (which they
called "hypothesis testing"). They initially considered two simple
hypotheses (both with frequency distributions). They calculated two
probabilities and typically selected the hypothesis associated with the
higher probability (the hypothesis more likely to have generated the
sample). Their method always selected a hypothesis. It also allowed the
calculation of both types of error probabilities.
Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson
considered their formulation to be an improved generalization of
significance testing.(The defining paper was abstract. Mathematicians have generalized and refined the theory for decades.)
Fisher thought that it was not applicable to scientific research
because often, during the course of the experiment, it is discovered
that the initial assumptions about the null hypothesis are questionable
due to unexpected sources of error. He believed that the use of rigid
reject/accept decisions based on models formulated before data is
collected was incompatible with this common scenario faced by scientists
and attempts to apply this method to scientific research would lead to
mass confusion.
The dispute between Fisher and Neyman–Pearson was waged on
philosophical grounds, characterized by a philosopher as a dispute over
the proper role of models in statistical inference.
Events intervened: Neyman accepted a position in the western
hemisphere, breaking his partnership with Pearson and separating
disputants (who had occupied the same building) by much of the planetary
diameter. World War II provided an intermission in the debate. The
dispute between Fisher and Neyman terminated (unresolved after 27 years)
with Fisher's death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman's later publications reported p-values and significance levels.
The modern version of hypothesis testing is a hybrid of the two
approaches that resulted from confusion by writers of statistical
textbooks (as predicted by Fisher) beginning in the 1940s. (But signal detection,
for example, still uses the Neyman/Pearson formulation.) Great
conceptual differences and many caveats in addition to those mentioned
above were ignored. Neyman and Pearson provided the stronger
terminology, the more rigorous mathematics and the more consistent
philosophy, but the subject taught today in introductory statistics has
more similarities with Fisher's method than theirs.
This history explains the inconsistent terminology (example: the null
hypothesis is never accepted, but there is a region of acceptance).
Sometime around 1940, in an apparent effort to provide researchers with a "non-controversial" way to have their cake and eat it too, the authors of statistical text books began anonymously combining these two strategies by using the p-value in place of the test statistic (or data) to test against the Neyman–Pearson "significance level". Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity
provided by hypothesis testing. It then became customary for the null
hypothesis, which was originally some realistic research hypothesis, to
be used almost solely as a strawman "nil" hypothesis (one where a treatment has no effect, regardless of the context).
A comparison between Fisherian, frequentist (Neyman–Pearson)
Fisher's null hypothesis testing
Neyman–Pearson decision theory
1. Set up a statistical null hypothesis. The null need not be a nil hypothesis (i.e., zero difference).
1. Set up two statistical hypotheses, H1 and H2, and decide about α,
β, and sample size before the experiment, based on subjective
cost-benefit considerations. These define a rejection region for each
hypothesis.
2. Report the exact level of significance (e.g., p = 0.051 or p =
0.049). Do not use a conventional 5% level, and do not talk about
accepting or rejecting hypotheses. If the result is "not significant",
draw no conclusions and make no decisions, but suspend judgement until
further data is available.
2. If the data falls into the rejection region of H1, accept H2;
otherwise accept H1. Note that accepting a hypothesis does not mean that
you believe in it, but only that you act as if it were true.
3. Use this procedure only if little is known about the problem at
hand, and only to draw provisional conclusions in the context of an
attempt to understand the experimental situation.
3. The usefulness of the procedure is limited among others to
situations where you have a disjunction of hypotheses (e.g., either μ1 =
8 or μ2 = 10 is true) and where you can make meaningful cost-benefit
trade-offs for choosing alpha and beta.
Early choices of null hypothesis
Paul Meehl has argued that the epistemological
importance of the choice of null hypothesis has gone largely
unacknowledged. When the null hypothesis is predicted by theory, a more
precise experiment will be a more severe test of the underlying theory.
When the null hypothesis defaults to "no difference" or "no effect", a
more precise experiment is a less severe test of the theory that
motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful:
1778:Pierre Laplace
compares the birthrates of boys and girls in multiple European cities.
He states: "it is natural to conclude that these possibilities are very
nearly in the same ratio". Thus Laplace's null hypothesis that the
birthrates of boys and girls should be equal given "conventional
wisdom".
1900:Karl Pearson develops the chi squared test
to determine "whether a given form of frequency curve will effectively
describe the samples drawn from a given population." Thus the null
hypothesis is that a population is described by some distribution
predicted by theory. He uses as an example the numbers of five and sixes
in the Weldon dice throw data.
1904:Karl Pearson develops the concept of "contingency" in order to determine whether outcomes are independent
of a given categorical factor. Here the null hypothesis is by default
that two things are unrelated (e.g. scar formation and death rates from
smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities".
Null hypothesis statistical significance testing
An
example of Neyman–Pearson hypothesis testing can be made by a change to
the radioactive suitcase example. If the "suitcase" is actually a
shielded container for the transportation of radioactive material, then a
test might be used to select among three hypotheses: no radioactive
source present, one present, two (all) present. The test could be
required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio).
A simple method of solution is to select the hypothesis with the
highest probability for the Geiger counts observed. The typical result
matches intuition: few counts imply no source, many counts imply two
sources and intermediate counts imply one source. Notice also that
usually there are problems for proving a negative. Null hypotheses should be at least falsifiable.
Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions.
The former allows each test to consider the results of earlier tests
(unlike Fisher's significance tests). The latter allows the
consideration of economic issues (for example) as well as probabilities.
A likelihood ratio remains a good criterion for selecting among
hypotheses.
The two forms of hypothesis testing are based on different
problem formulations. The original test is analogous to a true/false
question; the Neyman–Pearson test is more like multiple choice. In the
view of Tukey
the former produces a conclusion on the basis of only strong evidence
while the latter produces a decision on the basis of available evidence.
While the two tests seem quite different both mathematically and
philosophically, later developments lead to the opposite claim. Consider
many tiny radioactive sources. The hypotheses become 0,1,2,3... grains
of radioactive sand. There is little distinction between none or some
radiation (Fisher) and 0 grains of radioactive sand versus all of the
alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933
also considered composite hypotheses (ones whose distribution includes
an unknown parameter). An example proved the optimality of the
(Student's) t-test, "there can be no better test for the
hypothesis under consideration" (p 321). Neyman–Pearson theory was
proving the optimality of Fisherian methods from its inception.
Fisher's significance testing has proven a popular flexible
statistical tool in application with little mathematical growth
potential. Neyman–Pearson hypothesis testing is claimed as a pillar of
mathematical statistics, creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory and game theory. Both formulations have been successful, but the successes have been of a different character.
The dispute over formulations is unresolved. Science primarily
uses Fisher's (slightly modified) formulation as taught in introductory
statistics. Statisticians study Neyman–Pearson theory in graduate
school. Mathematicians are proud of uniting the formulations.
Philosophers consider them separately. Learned opinions deem the
formulations variously competitive (Fisher vs Neyman), incompatible or complementary. The dispute has become more complex since Bayesian inference has achieved respectability.
The terminology is inconsistent. Hypothesis testing can mean any
mixture of two formulations that both changed with time. Any discussion
of significance testing vs hypothesis testing is doubly vulnerable to
confusion.
Fisher thought that hypothesis testing was a useful strategy for
performing industrial quality control, however, he strongly disagreed
that hypothesis testing could be useful for scientists.
Hypothesis testing provides a means of finding test statistics used in significance testing. The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct. They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent.
While the existing merger of Fisher and Neyman–Pearson theories has
been heavily criticized, modifying the merger to achieve Bayesian goals
has been considered.
Criticism
Criticism of statistical hypothesis testing fills volumes citing 300–400 primary references. Much of the criticism can
be summarized by the following issues:
The interpretation of a p-value is dependent upon
stopping rule and definition of multiple comparison. The former often
changes during the course of a study and the latter is unavoidably
ambiguous. (i.e. "p values depend on both the (data) observed and on the
other possible (data) that might have been observed but weren't");
Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct;
Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments;
Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias.
Most of the criticism is indirect. Rather than being wrong, statistical
hypothesis testing is misunderstood, overused and misused;
When used to detect whether a difference exists between groups, a
paradox arises. As improvements are made to experimental design (e.g.,
increased precision of measurement and sample size), the test becomes
more lenient. Unless one accepts the absurd assumption that all sources
of noise in the data cancel out completely, the chance of finding
statistical significance in either direction approaches 100%;
Layers of philosophical concerns. The probability of statistical
significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless
while those not so based may be termed subjective. To minimize type II
errors, large samples are recommended. In psychology practically all
null hypotheses are claimed to be false for sufficiently large samples
so "...it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis.". "Statistically significant findings are often misleading" in psychology. Statistical significance does not imply practical significance and correlation does not imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis;
"[I]t does not tell us what we want to know". Lists of dozens of complaints are available.
Critics and supporters are largely in factual agreement regarding the
characteristics of null hypothesis significance testing (NHST): While
it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis.
The continuing controversy concerns the selection of the best
statistical practices for the near-term future given the (often poor)
existing practices. Critics would prefer to ban NHST completely, forcing
a complete departure from those practices, while supporters suggest a
less absolute change.
Controversy over significance testing, and its effects on
publication bias in particular, has produced several results. The
American Psychological Association has strengthened its statistical
reporting requirements after review,
medical journal publishers have recognized the obligation to publish
some results that are not statistically significant to combat
publication bias and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively. Textbooks have added some cautions
and increased coverage of the tools necessary to estimate the size of
the sample required to produce significant results. Major organizations
have not abandoned use of significance tests although some have
discussed doing so.
Alternatives
The numerous criticisms of significance testing do not lead to a
single alternative. A unifying position of critics is that statistics
should not lead to a conclusion or a decision but to a probability or to
an estimated value with a confidence interval
rather than to an accept-reject decision regarding a particular
hypothesis. It is unlikely that the controversy surrounding significance
testing will be resolved in the near future. Its supposed flaws and
unpopularity do not eliminate the need for an objective and transparent
means of reaching conclusions regarding studies that produce statistical
results. Critics have not unified around an alternative. Other forms of
reporting confidence or uncertainty could probably grow in popularity.
One strong critic of significance testing suggested a list of reporting
alternatives:
effect sizes for importance, prediction intervals for confidence,
replications and extensions for replicability, meta-analyses for
generality. None of these suggested alternatives produces a
conclusion/decision. Lehmann said that hypothesis testing theory can be
presented in terms of conclusions/decisions, probabilities, or
confidence intervals. "The distinction between the ... approaches is
largely one of reporting and interpretation."
On one "alternative" there is no disagreement: Fisher himself said,
"In relation to the test of significance, we may say that a phenomenon
is experimentally demonstrable when we know how to conduct an experiment
which will rarely fail to give us a statistically significant result."
Cohen, an influential critic of significance testing, concurred, "... don't look for a magic alternative to NHST [null hypothesis significance testing]
... It doesn't exist." "... given the problems of statistical
induction, we must finally rely, as have the older sciences, on
replication." The "alternative" to significance testing is repeated
testing. The easiest way to decrease statistical uncertainty is by
obtaining more data, whether by increased sample size or by repeated
tests. Nickerson claimed to have never seen the publication of a
literally replicated experiment in psychology. An indirect approach to replication is meta-analysis.
Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)). For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors
that exert only minimal influence on the results when enough data is
available. Psychologist John K. Kruschke has suggested Bayesian
estimation as an alternative for the t-test. Alternatively two competing models/hypothesis can be compared using Bayes factors.
Bayesian methods could be criticized for requiring information that is
seldom available in the cases where significance testing is most heavily
used. Neither the prior probabilities nor the probability distribution
of the test statistic under the alternative hypothesis are often
available in the social sciences.
Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected. Neither Fisher's significance testing, nor Neyman–Pearson
hypothesis testing can provide this information, and do not claim to.
The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of subjectivity in the form of the prior probability. Fisher's strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman–Pearson devised their approach of inductive behaviour.
Philosophy
Hypothesis testing and philosophy intersect. Inferential statistics,
which includes hypothesis testing, is applied probability. Both
probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability
reflect philosophical differences. The most common application of
hypothesis testing is in the scientific interpretation of experimental
data, which is naturally studied by the philosophy of science.
Fisher and Neyman opposed the subjectivity of probability. Their
views contributed to the objective definitions. The core of their
historical disagreement was philosophical.
Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments.
Hypothesis testing is of continuing interest to philosophers.
Education
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.
Many conclusions reported in the popular press (political opinion polls
to medical studies) are based on statistics. Some writers have stated
that statistical analysis of this kind allows for thinking clearly about
problems involving mass data, as well as the effective reporting of
trends and inferences from said data, but caution that writers for a
broad public should have a solid understanding of the field in order to
use the terms and concepts correctly.
An introductory college statistics class places much emphasis on
hypothesis testing – perhaps half of the course. Such fields as
literature and divinity now include findings based on statistical
analysis.
An introductory statistics class teaches hypothesis testing as a
cookbook process. Hypothesis testing is also taught at the postgraduate
level. Statisticians learn how to create good statistical test
procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues.
An academic study states that the cookbook method of teaching
introductory statistics leaves no time for history, philosophy or
controversy. Hypothesis testing has been taught as received unified
method. Surveys showed that graduates of the class were filled with
philosophical misconceptions (on all aspects of statistical inference)
that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.
Ideas for improving the teaching of hypothesis testing include
encouraging students to search for statistical errors in published
papers, teaching the history of statistics and emphasizing the
controversy in a generally dry subject.