Invalid science consists of scientific claims based on
experiments that cannot be reproduced or that are contradicted by
experiments that can be reproduced. Recent analyses indicate that the
proportion of retracted claims in the scientific literature is steadily
increasing.
The number of retractions has grown tenfold over the past decade, but
they still make up approximately 0.2% of the 1.4m papers published
annually in scholarly journals.
Science magazine ranked first for the number of articles retracted at 70, just edging out PNAS,
which retracted 69. Thirty-two of Science's retractions were due to
fraud or suspected fraud, and 37 to error. A subsequent "retraction
index" indicated that journals with relatively high impact factors, such
as Science, Nature and Cell, had a higher rate of retractions. Under 0.1% of papers in PubMed had were retracted of more than 25 million papers going back to the 1940s.
The fraction of retracted papers due to scientific misconduct was
estimated at two-thirds, according to studies of 2047 papers published
since 1977. Misconducted included fraud and plagiarism. Another
one-fifth were retracted because of mistakes, and the rest were pulled
for unknown or other reasons.
A separate study analyzed 432 claims of genetic links for various
health risks that vary between men and women. Only one of these claims
proved to be consistently reproducible. Another meta review, found that
of the 49 most-cited clinical research studies published between 1990
and 2003, more than 40 percent of them were later shown to be either
totally wrong or significantly incorrect.
Biological sciences
In 2012 biotech firm Amgen was able to reproduce just six of 53 important studies in cancer research. Earlier, a group at Bayer,
a drug company, successfully repeated only one fourth of 67 important
papers. In 2000-10 roughly 80,000 patients took part in clinical trials
based on research that was later retracted because of mistakes or
improprieties.
Paleontology
Nathan Mhyrvold
failed repeatedly to replicate the findings of several papers on
dinosaur growth. Dinosaurs added a layer to their bones each year. Tyrannosaurus rex
was thought to have increased in size by more than 700 kg a year, until
Mhyrvold showed that this was a factor of 2 too large. In 4 of 12
papers he examined, the original data had been lost. In three, the
statistics were correct, while three had serious errors that invalidated
their conclusions. Two papers mistakenly relied on data from these
three. He discovered that some of the paper's graphs did not reflect the
data. In one case, he found that only four of nine points on the graph
came from data cited in the paper.
Major retractions
Torcetrapib was originally hyped as a drug that could block a protein that converts HDL cholesterol into LDL with the potential to "redefine cardiovascular treatment". One clinical trial showed that the drug could increase HDL and decrease LDL. Two days after Pfizer
announced its plans for the drug, it ended the Phase III clinical trial
due to higher rates of chest pain and heart failure and a 60 percent
increase in overall mortality. Pfizer had invested more than $1 billion
in developing the drug.
An in-depth review of the most highly cited biomarkers (whose
presence are used to infer illness and measure treatment effects)
claimed that 83 percent of supposed correlations became significantly
weaker in subsequent studies. Homocysteine
is an amino acid whose levels correlated with heart disease. However, a
2010 study showed that lowering homocysteine by nearly 30 percent had
no effect on heart attack or stroke.
Priming
Priming
studies claim that decisions can be influenced by apparently irrelevant
events that a subject witnesses just before making a choice. Nobel
Prize-winner Daniel Kahneman alleges that much of it is poorly founded.
Researchers have been unable to replicate some of the more widely cited
examples. A paper in PLoS ONE
reported that nine separate experiments could not reproduce a study
purporting to show that thinking about a professor before taking an
intelligence test leads to a higher score than imagining a football
hooligan. A further systematic replication involving 40 different labs around the world did not replicate the main finding.
However, this latter systematic replication showed that participants
who did not think there was a relation between thinking about a hooligan
or a professor where significantly more susceptible to the priming
manipulation.
Potential causes
Competition
In the 1950s, when academic research accelerated during the cold war,
the total number of scientists was a few hundred thousand. In the new
century 6m-7m researchers are active. The number of research jobs has
not matched this increase. Every year six new PhDs compete for every
academic post. Replicating other researcher’s results is not perceived
to be valuable. The struggle to compete encourages exaggeration of
findings and biased data selection. A recent survey found that one in
three researchers knows of a colleague who has at least somewhat
distorted their results.
Publication bias
Major
journals reject in excess of 90% of submitted manuscripts and tend to
favor the most dramatic claims. The statistical measures that
researchers use to test their claims allow a fraction of false claims to
appear valid. Invalid claims are more likely to be dramatic (because
they are false.) Without replication, such errors are less likely to be
caught.
Conversely, failures to prove a hypothesis are rarely even
offered for publication. “Negative results” now account for only 14% of
published papers, down from 30% in 1990. Knowledge of what is not true
is as important as of what is true.
Peer review
Peer review
is the primary validation technique employed by scientific
publications. However, a prominent medical journal tested the system and
found major failings. It supplied research with induced errors and
found that most reviewers failed to spot the mistakes, even after being
told of the tests.
A pseudonymous fabricated paper on the effects of a chemical
derived from lichen on cancer cells was submitted to 304 journals for
peer review. The paper was filled with errors of study design, analysis
and interpretation. 157 lower-rated journals accepted it. Another study
sent an article containing eight deliberate mistakes in study design,
analysis and interpretation to more than 200 of the British Medical Journal’s regular reviewers. On average, they reported fewer than two of the problems.
Peer reviewers typically do not re-analyse data from scratch, checking only that the authors’ analysis is properly conceived.
Statistics
Type I and type II errors
Scientists
divide errors into type I, incorrectly asserting the truth of a
hypothesis (false positive) and type II, rejecting a correct hypothesis
(false negative). Statistical checks assess the probability that data
which seem to support a hypothesis come about simply by chance. If the
probability is less than 5%, the evidence is rated “statistically
significant”. One definitional consequence is a type one error rate of
one in 20.
Statistical power
In 2005 Stanford epidemiologist
John Ioannidis showed that the idea that only one paper in 20 gives a
false-positive result was incorrect. He claimed, “most published
research findings are probably false.” He found three categories of
problems: insufficient “statistical power” (avoiding type II errors); the unlikeliness of the hypothesis; and publication bias favoring novel claims.
A statistically powerful study identifies factors with only small
effects on data. In general studies with more repetitions that run the
experiment more times on more subjects have greater power. A power of
0.8 means that of ten true hypotheses tested, the effects of two are
missed. Ioannidis found that in neuroscience the typical statistical
power is 0.21; another study found that psychology studies average 0.35.
Unlikeliness is a measure of the degree of surprise in a result.
Scientists prefer surprising results, leading them to test hypotheses
that are unlikely to very unlikely. Ioannidis claimed that in
epidemiology, some one in ten hypotheses should be true. In exploratory
disciplines like genomics, which rely on examining voluminous data about
genes and proteins, only one in a thousand should prove correct.
In a discipline in which 100 out of 1,000 hypotheses are true,
studies with a power of 0.8 will find 80 and miss 20. Of the 900
incorrect hypotheses, 5% or 45 will be accepted because of type I
errors. Adding the 45 false positives to the 80 true positives gives 125
positive results, or 36% specious. Dropping statistical power to 0.4,
optimistic for many fields, would still produce 45 false positives but
only 40 true positives, less than half.
Negative results are more reliable. Statistical power of 0.8
produces 875 negative results of which only 20 are false, giving an
accuracy of over 97%. Negative results however account for a minority of
published results, varying by discipline. A study of 4,600 papers found
that the proportion of published negative results dropped from 30% to
14% between 1990 and 2007.
Subatomic physics sets an acceptable false-positive rate of one in 3.5m (known as the five-sigma standard). However, even this does not provide perfect protection. The problem invalidates some 3/4s of machine learning studies according to one review.
Statistical significance
Statistical significance is a measure for testing statistical correlation. It was invented by English mathematician Ronald Fisher
in the 1920s. It defines a “significant” result as any data point that
would be produced by chance less than 5 (or more stringently, 1) percent
of the time. A significant result is widely seen as an important
indicator that the correlation is not random.
While correlations track the relationship between truly
independent measurements, such as smoking and cancer, they are much less
effective when variables cannot be isolated, a common circumstance in
biological systems. For example, statistics found a high correlation
between lower back pain and abnormalities in spinal discs, although it
was later discovered that serious abnormalities were present in
two-thirds of pain-free patients.
Minimum threshold publishers
Journals
such as PLoS One use a “minimal-threshold” standard, seeking to publish
as much science as possible, rather than to pick out the best work.
Their peer reviewers assess only whether a paper is methodologically
sound. Almost half of their submissions are still rejected on that
basis.
Unpublished research
Only 22% of the clinical trials financed by the National Institutes of Health
(NIH) released summary results within one year of completion, even
though the NIH requires it. Fewer than half published within 30 months; a
third remained unpublished after 51 months.
When other scientists rely on invalid research, they may waste time on
lines of research that are themselves invalid. The failure to report
failures means that researchers waste money and effort exploring blind
alleys already investigated by other scientists.
Fraud
In 21
surveys of academics (mostly in the biomedical sciences but also in
civil engineering, chemistry and economics) carried out between 1987 and
2008, 2% admitted fabricating data, but 28% claimed to know of
colleagues who engaged in questionable research practices.
Lack of access to data and software
Clinical
trials are generally too costly to rerun. Access to trial data is the
only practical approach to reassessment. A campaign to persuade
pharmaceutical firms to make all trial data available won its first
convert in February 2013 when GlaxoSmithKline became the first to agree.
Software used in a trial is generally considered to be
proprietary intellectual property and is not available to replicators,
further complicating matters. Journals that insist on data-sharing tend
not to do the same for software.
Even well-written papers may not include sufficient detail and/or
tacit knowledge (subtle skills and extemporisations not considered
notable) for the replication to succeed. One cause of replication
failure is insufficient control of the protocol, which can cause
disputes between the original and replicating researchers.
Reform
Statistics training
Geneticists
have begun more careful reviews, particularly of the use of statistical
techniques. The effect was to stop a flood of specious results from genome sequencing.
Protocol registration
Registering
research protocols in advance and monitoring them over the course of a
study can prevent researchers from modifying the protocol midstream to
highlight preferred results. Providing raw data for other researchers to
inspect and test can also better hold researchers to account.
Post-publication review
Replacing
peer review with post-publication evaluations can encourage researchers
to think more about the long-term consequences of excessive or
unsubstantiated claims. That system was adopted in physics and
mathematics with good results.
Replication
Few
researchers, especially junior workers, seek opportunities to replicate
others' work, partly to protect relationships with senior researchers.
Reproduction benefits from access to the original study's methods
and data. More than half of 238 biomedical papers published in 84
journals failed to identify all the resources (such as chemical
reagents) necessary to reproduce the results. In 2008 some 60% of
researchers said they would share raw data; in 2013 just 45% do.
Journals have begun to demand that at least some raw data be made
available, although only 143 of 351 randomly selected papers covered by
some data-sharing policy actually complied.
The Reproducibility Initiative is a service allowing life
scientists to pay to have their work validated by an independent lab. In
October 2013 the initiative received funding to review 50 of the
highest-impact cancer findings published between 2010 and 2012. Blog Syn is a website run by graduate students that is dedicated to reproducing chemical reactions reported in papers.
In 2013 replication efforts received greater attention. Nature and related publications introduced an 18-point checklist for life science authors in May,
in its effort to ensure that its published research can be reproduced.
Expanded "methods" sections and all data were to be available online.
The Centre for Open Science opened as an independent laboratory focused
on replication. The journal Perspectives on Psychological Science
announced a section devoted to replications. Another project announced
plans to replicate 100 studies published in the first three months of
2008 in three leading psychology journals.
Publication bias is a type of bias that occurs in published academic research.
It occurs when the outcome of an experiment or research study
influences the decision whether to publish or otherwise distribute it.
Publishing only results that show a significant finding disturbs the balance of findings, and inserts bias in favor of positive results. The study of publication bias is an important topic in metascience.
Studies with significant results can be of the same standard as studies with a null result with respect to quality of execution and design. However, statistically significant results are three times more likely to be published than papers with null results.
A consequence of this is that researchers are unduly motivated to
manipulate their practices to ensure that a statistically significant
result is reported.
Multiple factors contribute to publication bias.
For instance, once a scientific finding is well established, it may
become newsworthy to publish reliable papers that fail to reject the
null hypothesis.
It has been found that the most common reason for non-publication is
simply that investigators decline to submit results, leading to non-response bias.
Factors cited as underlying this effect include investigators assuming
they must have made a mistake, failure to support a known finding, loss
of interest in the topic, or anticipation that others will be
uninterested in the null results.
The nature of these issues and the problems that have been triggered,
have been referred to as the 5 diseases that threaten science, which
include: "significosis, an inordinate focus on statistically significant results; neophilia, an excessive appreciation for novelty; theorrhea, a mania for new theory; arigorium, a deficiency of rigor in theoretical and empirical work; and finally, disjunctivitis, a proclivity to produce large quantities of redundant, trivial, and incoherent works."
Attempts to identify unpublished studies often prove difficult or are unsatisfactory. In an effort to combat this problem, some journals require that studies submitted for publication are pre-registered (registering a study prior to collection of data and analysis) with organizations like the Center for Open Science.
Other proposed strategies to detect and control for publication bias include p-curve analysis and disfavoring small and non-randomised studies because of their demonstrated high susceptibility to error and bias.
Definition
Publication
bias occurs when the publication of research results depends not just
on the quality of the research but also on the hypothesis tested, and
the significance and direction of effects detected.
The subject was first discussed in 1959 by statistician Theodore
Sterling to refer to fields in which "successful" research is more
likely to be published. As a result, "the literature of such a field
consists in substantial part of false conclusions resulting from errors
of the first kind in statistical tests of significance". In the worst case, false conclusions could canonize as being true if the publication rate of negative results is too low.
Publication bias is sometimes called the file-drawer effect, or file-drawer problem.
This term suggests that results not supporting the hypotheses of
researchers often go no further than the researchers' file drawers,
leading to a bias in published research. The term "file drawer problem" was coined by psychologist Robert Rosenthal in 1979.
Positive-results bias, a type of publication bias, occurs when
authors are more likely to submit, or editors are more likely to accept,
positive results than negative or inconclusive results.
Outcome reporting bias occurs when multiple outcomes are measured and
analyzed, but the reporting of these outcomes is dependent on the
strength and direction of its results. A generic term coined to describe
these post-hoc choices is HARKing ("Hypothesizing After the Results are Known").
Evidence
Meta-analysis of stereotype threat on girls' math scores showing asymmetry typical of publication bias. From Flore, P. C., & Wicherts, J. M. (2015)
There is extensive meta-research
on publication bias in the biomedical field. Investigators following
clinical trials from the submission of their protocols to ethics
committees (or regulatory authorities) until the publication of their
results observed that those with positive results are more likely to be
published.
In addition, studies often fail to report negative results when
published, as demonstrated by research comparing study protocols with
published articles.
The presence of publication bias was investigated in meta-analyses. The largest such analysis investigated the presence of publication bias in systematic reviews of medical treatments from the Cochrane Library.
The study showed that statistically positive significant findings are
27% more likely to be included in meta-analyses of efficacy than other
findings. Results showing no evidence of adverse effects have a 78%
greater probability of inclusion in safety studies than statistically
significant results showing adverse effects. Evidence of publication
bias was found in meta-analyses published in prominent medical journals.
Impact on meta-analysis
Where
publication bias is present, published studies are no longer a
representative sample of the available evidence. This bias distorts the
results of meta-analyses and systematic reviews. For example, evidence-based medicine is increasingly reliant on meta-analysis to assess evidence.
Meta-analyses and systematic reviews can account for publication
bias by including evidence from unpublished studies and the grey
literature. The presence of publication bias can also be explored by
constructing a funnel plot
in which the estimate of the reported effect size is plotted against a
measure of precision or sample size. The premise is that the scatter of
points should reflect a funnel shape, indicating that the reporting of
effect sizes is not related to their statistical significance.
However, when small studies are predominately in one direction (usually
the direction of larger effect sizes), asymmetry will ensue and this
may be indicative of publication bias.
Because an inevitable degree of subjectivity exists in the
interpretation of funnel plots, several tests have been proposed for
detecting funnel plot asymmetry.
These are often based on linear regression, and may adopt a
multiplicative or additive dispersion parameter to adjust for the
presence of between-study heterogeneity. Some approaches may even
attempt to compensate for the (potential) presence of publication bias, which is particularly useful to explore the potential impact on meta-analysis results.
Compensation examples
Two meta-analyses of the efficacy of reboxetine as an antidepressant
demonstrated attempts to detect publication bias in clinical trials.
Based on positive trial data, reboxetine was originally passed as a
treatment for depression in many countries in Europe and the UK in 2001
(though in practice it is rarely used for this indication). A 2010
meta-analysis concluded that reboxetine was ineffective and that the
preponderance of positive-outcome trials reflected publication bias,
mostly due to trials published by the drug manufacturer Pfizer.
A subsequent meta-analysis published in 2011, based on the original
data, found flaws in the 2010 analyses and suggested that the data
indicated reboxetine was effective in severe depression. Examples of publication bias are given by Ben Goldacre and Peter Wilmshurst.
In the social sciences, a study of published papers exploring the
relationship between corporate social and financial performance found
that "in economics, finance, and accounting journals, the average
correlations were only about half the magnitude of the findings
published in Social Issues Management, Business Ethics, or Business and
Society journals".
One example cited as an instance of publication bias is the
refusal to publish attempted replications of Bem's work that claimed
evidence for precognition by The Journal of Personality and Social Psychology (the original publisher of Bem's article).
An analysis
comparing studies of gene-disease associations originating in China to
those originating outside China found that those conducted within the
country reported a stronger association and a more statistically
significant result.
Risks
John Ioannidis argues that "claimed research findings may often be simply accurate measures of the prevailing bias."
He lists the following factors as those that make a paper with a
positive result more likely to enter the literature and suppress
negative-result papers:
The studies conducted in a field have small sample sizes.
Publication
bias can be contained through better-powered studies, enhanced research
standards, and careful consideration of true and non-true
relationships.
Better-powered studies refer to large studies that deliver definitive
results or test major concepts and lead to low-bias meta-analysis.
Enhanced research standards such as the pre-registration of protocols,
the registration of data collections and adherence to established
protocols are other techniques. To avoid false-positive results, the
experimenter must consider the chances that they are testing a true or
non-true relationship. This can be undertaken by properly assessing the
false positive report probability based on the statistical power of the
test and reconfirming (whenever ethically acceptable) established findings of prior studies known to have minimal bias.
The World Health Organization
(WHO) agreed that basic information about all clinical trials should be
registered at the study's inception, and that this information should
be publicly accessible through the WHO International Clinical Trials Registry Platform.
Additionally, public availability of complete study protocols,
alongside reports of trials, is becoming more common for studies.
Grok/ˈɡrɒk/ is a neologism coined by American writer Robert A. Heinlein for his 1961 science fiction novel Stranger in a Strange Land. While the Oxford English Dictionary summarizes the meaning of grok
as "to understand intuitively or by empathy, to establish rapport with"
and "to empathize or communicate sympathetically (with); also, to
experience enjoyment",
Heinlein's concept is far more nuanced, with critic Istvan
Csicsery-Ronay Jr. observing that "the book's major theme can be seen as
an extended definition of the term." The concept of grok
garnered significant critical scrutiny in the years after the book's
initial publication. The term and aspects of the underlying concept have
become part of communities such as computer science.
Descriptions of grok in Stranger in a Strange Land
Critic David E. Wright Sr. points out that in the 1991 "uncut" edition of Stranger, the word grok "was used first without any explicit definition on page 22" and continued to be used without being explicitly defined until page 253 (emphasis in original). He notes that this first intensional definition is simply "to drink", but that this is only a metaphor "much as English 'I see' often means the same as 'I understand'". Critics have bridged this absence of explicit definition by citing passages from Stranger that illustrate the term. A selection of these passages follows:
Grok means "to understand",
of course, but Dr. Mahmoud, who might be termed the leading Terran
expert on Martians, explains that it also means, "to drink" and "a
hundred other English words, words which we think of as antithetical
concepts. 'Grok' means all of these. It means 'fear', it means
'love', it means 'hate' – proper hate, for by the Martian 'map' you
cannot hate anything unless you grok it, understand it so thoroughly
that you merge with it and it merges with you – then you can hate it. By
hating yourself. But this implies that you love it, too, and cherish it
and would not have it otherwise. Then you can hate – and (I think) Martian hate is an emotion so black that the nearest human equivalent could only be called mild distaste.
Grok means "identically
equal". The human cliché "This hurts me worse than it does you" has a
distinctly Martian flavor. The Martian seems to know instinctively what
we learned painfully from modern physics, that observer acts with
observed through the process of observation. Grok means to
understand so thoroughly that the observer becomes a part of the
observed – to merge, blend, intermarry, lose identity in group
experience. It means almost everything that we mean by religion,
philosophy, and science and it means as little to us as color does to a
blind man.
The Martian Race had encountered
the people of the fifth planet, grokked them completely, and had taken
action; asteroid ruins were all that remained, save that the Martians
continued to praise and cherish the people they had destroyed.
All that groks is God.
Etymology
Robert A. Heinlein originally coined the term grok in his 1961 novel Stranger in a Strange Land as a Martian
word that could not be defined in Earthling terms, but can be
associated with various literal meanings such as "water", "to drink",
"life", or "to live", and had a much more profound figurative meaning
that is hard for terrestrial culture to understand because of its
assumption of a singular reality.
According to the book, drinking water is a central focus on Mars, where it is scarce.
Martians use the merging of their bodies with water as a simple example
or symbol of how two entities can combine to create a new reality
greater than the sum of its parts. The water becomes part of the
drinker, and the drinker part of the water. Both grok each other.
Things that once had separate realities become entangled in the same
experiences, goals, history, and purpose. Within the book, the statement
of divineimmanence verbalized among the main characters, "thou art God", is logically derived from the concept inherent in the term grok.
Heinlein describes Martian words as "guttural" and "jarring".
Martian speech is described as sounding "like a bullfrog fighting a
cat". Accordingly, grok is generally pronounced as a guttural gr terminated by a sharp k with very little or no vowel sound (a narrow IPA transcription might be [ɡɹ̩kʰ]). William Tenn suggests Heinlein in creating the word might have been influenced by Tenn's very similar concept of griggo, earlier introduced in Tenn's story "Venus and the Seven Sexes" (published in 1949). In his later afterword to the story, Tenn says Heinlein considered such influence "very possible".
Adoption and modern usage
In computer programmer culture
Uses of the word in the decades after the 1960s are more concentrated in computer culture, such as a 1984 appearance in InfoWorld:
"There isn't any software! Only different internal states of hardware.
It's all hardware! It's a shame programmers don't grok that better."
The Jargon File, which describes itself as a "Hacker's Dictionary" and has been published under that name three times, puts grok in a programming context:
When you claim to "grok" some
knowledge or technique, you are asserting that you have not merely
learned it in a detached instrumental way but that it has become part of
you, part of your identity. For example, to say that you "know" Lisp
is simply to assert that you can code in it if necessary – but to say
you "grok" Lisp is to claim that you have deeply entered the world-view
and spirit of the language, with the implication that it has transformed
your view of programming. Contrast zen, which is a similar supernatural understanding experienced as a single brief flash.
The entry existed in the very earliest forms of the Jargon File, dating from the early 1980s. A typical tech usage from the Linux Bible, 2005 characterizes the Unixsoftware development philosophy as "one that can make your life a lot simpler once you grok the idea".
The book Perl Best Practices defines grok as understanding a portion of computer code in a profound way. It goes on to suggest that to re-grok
code is to reload the intricacies of that portion of code into one's
memory after some time has passed and all the details of it are no
longer remembered. In that sense, to grok means to load everything into memory for immediate use. It is analogous to the way a processor caches
memory for short term use, but the only implication by this reference
was that it was something a human (or perhaps a Martian) would do.
The main web page for cURL, an open source tool and programming library, describes the function of cURL as "cURL groks URLs".
The book Cyberia covers its use in this subculture extensively:
This is all latter day usage, the
original derivation was from an early text processing utility from so
long ago that no one remembers but, grok was the output when it
understood the file. K&R would remember.
The keystroke logging software used by the NSA for its remote intelligence gathering operations is named GROK.
One of the most powerful parsing filters used in ElasticSearch software's logstash component is named grok.
A reference book by Carey Bunks on the use of the GNU Image Manipulation Program is titled Grokking the GIMP
In counterculture
Tom Wolfe, in his book The Electric Kool-Aid Acid Test (1968), describes a character's thoughts during an acid trip: "He looks down, two bare legs, a torso rising up at him and like he is just noticing them for the first time... he has never seen any of this flesh before, this stranger. He groks over that..."
In his counterculture Volkswagen repair manual, How to Keep Your Volkswagen Alive: A Manual of Step-by-Step Procedures for the Compleat Idiot (1969), dropout aerospace engineer John Muir instructs prospective used VW buyers to "grok the car" before buying.
A neologism (/niːˈɒlədʒɪzəm/; from Greek νέο- néo-, "new" and λόγος lógos,
"speech, utterance") is a relatively recent or isolated term, word, or
phrase that may be in the process of entering common use, but that has
not yet been fully accepted into mainstream language. Neologisms are often driven by changes in culture and technology. In the process of language formation, neologisms are more mature than protologisms. A word whose development stage is between that of the protologism (freshly coined) and neologism (new word) is a prelogism.
Popular examples of neologisms can be found in science, fiction (notably science fiction), films and television, branding, literature, jargon, cant, linguistic and popular culture.
Neologisms are often formed by combining existing words (see compound noun and adjective) or by giving words new and unique suffixes or prefixes. Neologisms can also be formed by blending words, for example, "brunch" is a blend of the words "breakfast" and "lunch", or through abbreviation or acronym, by intentionally rhyming with existing words or simply through playing with sounds.
Neologisms can become popular through memetics, through mass media, the Internet, and word of mouth, including academic discourse in many fields renowned for their use of distinctive jargon,
and often become accepted parts of the language. Other times, they
disappear from common use just as readily as they appeared. Whether a
neologism continues as part of the language depends on many factors,
probably the most important of which is acceptance by the public. It is
unusual for a word to gain popularity if it does not clearly resemble
other words.
History and meaning
The term neologism is first attested in English in 1772, borrowed from French néologisme (1734),
being called the "neologist-in-chief". In an academic sense, there is
no professional Neologist, because the study of such things (cultural or
ethnic vernacular, for example) is interdisciplinary. Anyone such as a lexicographer or an etymologist
might study neologisms, how their uses span the scope of human
expression, and how, due to science and technology, they spread more
rapidly than ever before in the present times.
The term neologism has a broader meaning which also includes "a word which has gained a new meaning". Sometimes, the latter process is called semantic shifting, or semantic extension. Neologisms are distinct from a person's idiolect, one's unique patterns of vocabulary, grammar, and pronunciation.
Neologisms are usually introduced when it is found that a
specific notion is lacking a term, or when the existing vocabulary lacks
detail, or when a speaker is unaware of the existing vocabulary. The law, governmental bodies, and technology have a relatively high frequency of acquiring neologisms.
Another trigger that motivates the coining of a neologism is to
disambiguate a term which may be unclear due to having many meanings.
The title of a book may become a neologism, for instance, Catch-22 (from the title of Joseph Heller's
novel). Alternatively, the author's name may give rise to the
neologism, although the term is sometimes based on only one work of that
author. This includes such words as "Orwellian" (from George Orwell, referring to his dystopian novel Nineteen Eighty-Four) and "Kafkaesque" (from Franz Kafka), which refers to arbitrary, complex bureaucratic systems.
Polari is a cant used by some actors, circus performers, and the gay subculture
to communicate without outsiders understanding. Some Polari terms have
crossed over into mainstream slang, in part through their usage in pop
song lyrics and other works. Example include: acdc, barney, blag, butch, camp, khazi, cottaging, hoofer, mince, ogle, scarper, slap, strides, tod, [rough] trade (rough trade).
Verlan (French pronunciation: [vɛʁlɑ̃]), (verlan is the reverse of the expression "l'envers") is a type of argot in the French language, featuring inversion of syllables in a word, and is common in slang and youth language. It rests on a long French tradition of transposing syllables of individual words to create slang words.[20]:50 Some verlan words, such as meuf ("femme", which means "woman" roughly backwards), have become so commonplace that they have been included in the Petit Larousse. Like any slang, the purpose of verlan
is to create a somewhat secret language that only its speakers can
understand. Words becoming mainstream is counterproductive. As a result,
such newly common words are re-verlanised: reversed a second time. The
common meuf became feumeu.
Popular culture
Neologism
development may be spurred, or at least spread, by popular culture.
Examples of pop-culture neologisms include the American Alt-right (2010s), the Canadian portmanteau "Snowmageddon" (2009), the Russian parody "Monstration" (ca. 2004), Santorum (c. 2003).
Neologisms spread mainly through their exposure in mass media. The genericizing of brand names, such as "coke" for Coca-Cola, "kleenex" for Kleenex facial tissue, and "xerox" for Xeroxphotocopying, all spread through their popular use being enhanced by mass media.
However, in some limited cases, words break out of their original communities and spread through social media. "Doggo-Lingo", a term still below the threshold of a neologism according to Merriam-Webster, is an example of the latter which has specifically spread primarily through Facebook group and Twitter account use.
The suspected origin of this way of referring to dogs stems from a
Facebook group founded in 2008 and gaining popularity in 2014 in
Australia. In Australian English it is common to use diminutives, often ending in –o, which could be where doggo-lingo was first used.
The term has grown so that Merriam-Webster has acknowledged its use but
notes the term needs to be found in published, edited work for a longer
period of time before it can be deemed a new word, making it the
perfect example of a neologism.
Translations
Because neologisms originate in one language, translations between languages can be difficult.
In the scientific community, where English is the predominant
language for published research and studies, like-sounding translations
(referred to as 'naturalization') are sometimes used. Alternatively, the English word is used along with a brief explanation of meaning.
The four translation methods are emphasized in order to translate neologisms: transliteration, transcription, the use of analogues, calque or loan translation.
When translating from English to other languages, the naturalization method is most often used. The most common way that professional translators translate neologisms is through the Think aloud protocol (TAP), wherein translators find the most appropriate and natural sounding word through speech.
As such, translators can use potential translations in sentences and
test them with different structures and syntax. Correct translations
from English for specific purposes into other languages is crucial in various industries and legal systems. Inaccurate translations can lead to 'translation asymmetry' or misunderstandings and miscommunication.
Many technical glossaries of English translations exist to combat this
issue in the medical, judicial, and technological fields.
Other uses
In psychiatry and neuroscience, the term neologism is used to describe words that have meaning only to the person who uses them, independent of their common meaning. This can be seen in schizophrenia,
where a person may replace a word with a nonsensical one of their own
invention, e.g. “I got so angry I picked up a dish and threw it at the
geshinker.” The use of neologisms may also be due to aphasia acquired after brain damage resulting from a stroke or head injury.
The replication crisis (also called the replicability crisis and the reproducibility crisis) is an ongoing methodological crisis in which it has been found that many scientific studies are difficult or impossible to replicate or reproduce. The replication crisis most severely affects the social sciences and medicine. The phrase was coined in the early 2010s as part of a growing awareness of the problem. The replication crisis represents an important body of research in the field of metascience.
Because the reproducibility of experimental results is an essential part of the scientific method,
an inability to replicate the studies of others has potentially grave
consequences for many fields of science in which significant theories
are grounded on unreproducible experimental work. The replication crisis
has been particularly widely discussed in the fields of medicine, where
a number of efforts have been made to re-investigate classic results,
to determine both the reliability of the results and, if found to be
unreliable, the reasons for the failure of replication.
Scope
Overall
A
2016 poll of 1,500 scientists reported that 70% of them had failed to
reproduce at least one other scientist's experiment (50% had failed to
reproduce one of their own experiments).
In 2009, 2% of scientists admitted to falsifying studies at least once
and 14% admitted to personally knowing someone who did. Misconducts were
reported more frequently by medical researchers than others.
In psychology
Several factors have combined to put psychology at the center of the controversy. According to a 2018 survey of 200 meta-analyses, "psychological research is, on average, afflicted with low statistical power". Much of the focus has been on the area of social psychology, although other areas of psychology such as clinical psychology, developmental psychology, and educational research have also been implicated.
Firstly, questionable research practices (QRPs) have been identified as common in the field.
Such practices, while not intentionally fraudulent, involve
capitalizing on the gray area of acceptable scientific practices or
exploiting flexibility in data collection, analysis, and reporting,
often in an effort to obtain a desired outcome. Examples of QRPs include
selective reporting
or partial publication of data (reporting only some of the study
conditions or collected dependent measures in a publication), optional
stopping (choosing when to stop data collection,
often based on statistical significance of tests), post-hoc
storytelling (framing exploratory analyses as confirmatory analyses),
and manipulation of outliers (either removing outliers or leaving outliers in a dataset to cause a statistical test to be significant). A survey of over 2,000 psychologists indicated that a majority of respondents admitted to using at least one QRP.
The publication bias (see Section "Causes" below) leads to an elevated number of false positive results. It is augmented by the pressure to publish as well as the author's own confirmation bias and is an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers.
Secondly, psychology and social psychology in particular, has found itself at the center of several scandals involving outright fraudulent research, most notably the admitted data fabrication by Diederik Stapel as well as allegations against others. However, most scholars acknowledge that fraud is, perhaps, the lesser contribution to replication crises.
Thirdly, several effects in psychological science have been found
to be difficult to replicate even before the current replication
crisis. For example, the scientific journal Judgment and Decision Making has published several studies over the years that fail to provide support for the unconscious thought theory.
Replications appear particularly difficult when research trials are
pre-registered and conducted by research groups not highly invested in
the theory under questioning.
These three elements together have resulted in renewed attention for replication supported by psychologist Daniel Kahneman. Scrutiny of many effects have shown that several core beliefs are hard to replicate. A 2014 special edition of the journal Social Psychology focused on replication studies and a number of previously held beliefs were found to be difficult to replicate. A 2012 special edition of the journal Perspectives on Psychological Science also focused on issues ranging from publication bias to null-aversion that contribute to the replication crises in psychology. In 2015, the first open empirical study of reproducibility in psychology was published, called the Reproducibility Project.
Researchers from around the world collaborated to replicate 100
empirical studies from three top psychology journals. Fewer than half of
the attempted replications were successful at producing statistically
significant results in the expected directions, though most of the
attempted replications did produce trends in the expected directions.
Many research trials and meta-analyses are compromised by poor quality and conflicts of interest that involve both authors and professional advocacy organizations, resulting in many false positives regarding the effectiveness of certain types of psychotherapy.
Although the British newspaper The Independent wrote that the results of the reproducibility project show that much of the published research is just "psycho-babble", the replication crisis does not necessarily mean that psychology is unscientific.
Rather this process is part of the scientific process in which old
ideas or those that cannot withstand careful scrutiny are pruned, although this pruning process is not always effective. The consequence is that some areas of psychology once considered solid, such as social priming, have come under increased scrutiny due to failed replications.
Nobel laureate and professor emeritus in psychology Daniel Kahneman
argued that the original authors should be involved in the replication
effort because the published methods are often too vague. Others such as Dr. Andrew Wilson disagree and argue that the methods should be written down in detail.
An investigation of replication rates in psychology in 2012 indicated
higher success rates of replication in replication studies when there
was author overlap with the original authors of a study
(91.7% successful replication rates in studies with author overlap
compared to 64.6% success replication rates without author overlap).
Focus on the replication crisis has led to other renewed efforts in the discipline to re-test important findings.
In response to concerns about publication bias and p-hacking, more than 140 psychology journals have adopted result-blind peer review
where studies are accepted not on the basis of their findings and after
the studies are completed, but before the studies are conducted and
upon the basis of the methodological rigor
of their experimental designs and the theoretical justifications for
their statistical analysis techniques before data collection or analysis
is done.
Early analysis of this procedure has estimated that 61 percent of result-blind studies have led to null results, in contrast to an estimated 5 to 20 percent in earlier research.
In addition, large-scale collaborations between researchers working in
multiple labs in different countries and that regularly make their data
openly available for different researchers to assess have become much
more common in the field.
Psychology replication rates
A report by the Open Science Collaboration in August 2015 that was coordinated by Brian Nosek estimated the reproducibility of 100 studies in psychological science from three high-ranking psychology journals. Overall, 36% of the replications yielded significant findings (p value below 0.05) compared to 97% of the original studies that had significant effects. The mean effect size in the replications was approximately half the magnitude of the effects reported in the original studies.
An analysis of the publication history in the top 100 psychology
journals between 1900 and 2012 indicated that approximately 1.6% of all
psychology publications were replication attempts.
Articles were considered a replication attempt if the term
"replication" appeared in the text. A subset of those studies (500
studies) was randomly selected for further examination and yielded a
lower replication rate of 1.07% (342 of the 500 studies [68.4%] were
actually replications). In the subset of 500 studies, analysis indicated
that 78.9% of published replication attempts were successful.
A study published in 2018 in Nature Human Behaviour sought to replicate 21 social and behavioral science papers from Nature and Science, finding that only 13 could be successfully replicated.Similarly, in a study conducted under the auspices of the Center for Open Science,
a team of 186 researchers from 60 different laboratories (representing
36 different nationalities from 6 different continents) conducted
replications of 28 classic and contemporary findings in psychology.
The focus of the study was not only on whether or not the findings from
the original papers replicated, but also on the extent to which
findings varied as a function of variations in samples and contexts.
Overall, 14 of the 28 findings failed to replicate despite massive
sample sizes. However, if a finding replicated, it replicated in most
samples, while if a finding was not replicated, it failed to replicate
with little variation across samples and contexts. This evidence is
inconsistent with a popular explanation that failures to replicate in
psychology are likely due to changes in the sample between the original
and replication study.
A disciplinary social dilemma
Highlighting
the social structure that discourages replication in psychology, Brian
D. Earp and Jim A. C. Everett enumerated five points as to why
replication attempts are uncommon:
"Independent, direct replications of others' findings can be time-consuming for the replicating researcher"
"[Replications] are likely to take energy and resources directly
away from other projects that reflect one's own original thinking"
"[Replications] are generally harder to publish (in large part because they are viewed as being unoriginal)"
"Even if [replications] are published, they are likely to be seen as
'bricklaying' exercises, rather than as major contributions to the
field"
"[Replications] bring less recognition and reward, and even basic career security, to their authors"
For these reasons the authors advocated that psychology is facing a
disciplinary social dilemma, where the interests of the discipline are
at odds with the interests of the individual researcher.
"Methodological terrorism" controversy
With the replication crisis of psychology earning attention, Princeton University psychologist Susan Fiske drew controversy for calling out critics of psychology.
She labeled these unidentified "adversaries" with names such as
"methodological terrorist" and "self-appointed data police", and said
that criticism of psychology should only be expressed in private or
through contacting the journals. Columbia University statistician and political scientist Andrew Gelman,
responded to Fiske, saying that she had found herself willing to
tolerate the "dead paradigm" of faulty statistics and had refused to
retract publications even when errors were pointed out.
He added that her tenure as editor has been abysmal and that a number
of published papers edited by her were found to be based on extremely
weak statistics; one of Fiske's own published papers had a major
statistical error and "impossible" conclusions.
In medicine
Out
of 49 medical studies from 1990–2003 with more than 1000 citations, 45
claimed that the studied therapy was effective. Out of these studies,
16% were contradicted by subsequent studies, 16% had found stronger
effects than did subsequent studies, 44% were replicated, and 24%
remained largely unchallenged. The US Food and Drug Administration in 1977–1990 found flaws in 10–20% of medical studies. In a paper published in 2012, Glenn Begley, a biotech consultant working at Amgen, and Lee Ellis, at the University of Texas, found that only 11% of 53 pre-clinical cancer studies could be replicated.
The irreproducible studies had a number of features in common, including
that studies were not performed by investigators blinded to the
experimental versus the control arms, there was a failure to repeat
experiments, a lack of positive and negative controls, failure to show
all the data, inappropriate use of statistical tests and use of reagents
that were not appropriately validated.
A survey on cancer researchers found that half of them had been unable to reproduce a published result.
A similar survey by Nature on 1,576 researchers who took a brief
online questionnaire on reproducibility showed that more than 70% of
researchers have tried and failed to reproduce another scientist's
experiments, and more than half have failed to reproduce their own
experiments. "Although 52% of those surveyed agree there is a
significant 'crisis' of reproducibility, less than 31% think failure to
reproduce published results means the result is probably wrong, and most
say they still trust the published literature."
A 2016 article by John Ioannidis,
Professor of Medicine and of Health Research and Policy at Stanford
University School of Medicine and a Professor of Statistics at Stanford
University School of Humanities and Sciences, elaborated on "Why Most
Clinical Research Is Not Useful".
In the article Ioannidis laid out some of the problems and called for
reform, characterizing certain points for medical research to be useful
again; one example he made was the need for medicine to be "patient
centered" (e.g. in the form of the Patient-Centered Outcomes Research Institute) instead of the current practice to mainly take care of "the needs of physicians, investigators, or sponsors".
In marketing
Marketing is another discipline with a "desperate need" for replication. Many famous marketing studies fail to be repeated upon replication, a notable example being the "too-many-choices" effect, in which a high number of choices of product makes a consumer less likely to purchase.
In addition to the previously mentioned arguments, replication studies
in marketing are needed to examine the applicability of theories and
models across countries and cultures, which is especially important
because of possible influences of globalization.
In economics
A 2016 study in the journal Science found that one-third of 18 experimental studies from two top-tier economics journals (American Economic Review and the Quarterly Journal of Economics) failed to successfully replicate. A 2017 study in the Economic Journal
suggested that "the majority of the average effects in the empirical
economics literature are exaggerated by a factor of at least 2 and at
least one-third are exaggerated by a factor of 4 or more".
In sports science
A 2018 study took the field of exercise and sports science
to task for insufficient replication studies, limited reporting of both
null and trivial results, and insufficient research transparency. Statisticians have criticized sports science for common use of a controversial statistical method called "magnitude-based inference"
which has allowed sports scientists to extract apparently significant
results from noisy data where ordinary hypothesis testing would have
found none.
In water resource management
A 2019 study in Scientific Data suggested that only a small number of articles in water resources and management
journals could be reproduced, while the majority of articles were not
replicable due to data unavailability. The study estimated with 95%
confidence that "results might be reproduced for only 0.6% to 6.8% of
all 1,989 articles".
Political repercussions
In
the US, science's reproducibility crisis has become a topic of
political contention, linked to the attempt to diminish regulations –
e.g. of emissions of pollutants, with the argument that these
regulations are based on non-reproducible science. Previous attempts with the same aim accused studies used by regulators of being non-transparent.
Public awareness and perceptions
Concerns
have been expressed within the scientific community that the general
public may consider science less credible due to failed replications.
Research supporting this concern is sparse, but a nationally
representative survey in Germany showed that more than 75% of Germans
have not heard of replication failures in science.
The study also found that most Germans have positive perceptions of
replication efforts: Only 18% think that non-replicability shows that
science cannot be trusted, while 65% think that replication research
shows that science applies quality control, and 80% agree that errors
and corrections are part of science.
Causes
A major cause of low reproducibility is the publication bias and the selection bias,
in turn caused by the fact that statistically insignificant results are
rarely published or discussed in publications on multiple potential
effects. Among potential effects that are inexistent (or tiny), the
statistical tests show significance (at the usual level) with 5%
probability. If a large number of such effects are screened in a chase
for significant results, these erroneously significant ones inundate the
appropriately found ones, and they lead to (still erroneously)
successful replications again with just 5% probability.
An increasing proportion of such studies thus progressively lowers the
replication rate corresponding to studies of plausibly relevant effects.
Erroneously significant results may also come from questionable
practices in data analysis called
data dredging or P-hacking, HARKing, and researcher degrees of freedom.
Generation of new data/publications at an unprecedented rate.
Majority of these discoveries will not stand the test of time.
Failure to adhere to good scientific practice and the desperation to publish or perish.
Multiple varied stakeholders.
They conclude that no party is solely responsible, and no single solution will suffice.
These issues may lead to the canonization of false facts.
In fact, some predictions of an impending crisis in the quality
control mechanism of science can be traced back several decades,
especially among scholars in science and technology studies (STS). Derek de Solla Price – considered the father of scientometrics – predicted that science could reach 'senility' as a result of its own exponential growth. Some present day literature seems to vindicate this 'overflow' prophecy, lamenting the decay in both attention and quality.
Philosopher and historian of science Jerome R. Ravetz predicted in his 1971 book Scientific Knowledge and Its Social Problems
that science – in its progression from "little" science composed of
isolated communities of researchers, to "big" science or
"techno-science" – would suffer major problems in its internal system of
quality control. Ravetz recognized that the incentive structure for
modern scientists could become dysfunctional, now known as the present
'publish or perish' challenge, creating perverse incentives
to publish any findings, however dubious. According to Ravetz, quality
in science is maintained only when there is a community of scholars
linked by a set of shared norms and standards, all of whom are willing
and able to hold one another accountable.
Historian Philip Mirowski offered a similar diagnosis in his 2011 book Science Mart (2011).
In the title, the word 'Mart' is in reference to the retail giant
'Walmart', used by Mirowski as a metaphor for the commodification of
science. In Mirowski's analysis, the quality of science collapses when
it becomes a commodity being traded in a market. Mirowski argues his
case by tracing the decay of science to the decision of major
corporations to close their in-house laboratories. They outsourced their
work to universities in an effort to reduce costs and increase profits.
The corporations subsequently moved their research away from
universities to an even cheaper option – Contract Research Organizations
(CRO).
The crisis of science's quality control system is affecting the
use of science for policy. This is the thesis of a recent work by a
group of STS scholars, who identify in 'evidence based (or informed)
policy' a point of present tension.
Economist Noah Smith suggests that a factor in the crisis has been the
overvaluing of research in academia and undervaluing of teaching
ability, especially in fields with few major recent discoveries.
Social system theory, due to the German sociologist Niklas Luhmann offers another reading of the crisis . According to this theory each
the systems such as 'economy', 'science', 'religion', 'media' and so on
communicates using its own code, true/false for science, profit/loss for
the economy, new/no-news for the media; according to some sociologists, science's mediatization, its commodification and its politicization
– as a result of the structural coupling among systems – have led to a
confusion of the original system codes. If science's code true/false is
substituted for by those of the other systems, such as profit/loss,
news/no-news, science's operation enters into an internal crisis.
Response
Replication has been referred to as "the cornerstone of science".
Replication studies attempt to evaluate whether published results
reflect true findings or false positives. The integrity of scientific
findings and reproducibility of research are important as they form the
knowledge foundation on which future studies are built.
Metascience
Metascience is the use of scientific methodology to study science itself. Metascience seeks to increase the quality of scientific research while reducing waste. It is also known as "research on research" and "the science of science", as it uses research methods to study how research
is done and where improvements can be made. Metascience concerns itself
with all fields of research and has been described as "a bird's eye
view of science." In the words of John Ioannidis, "Science is the best thing that has happened to human beings ... but we can do it better."
Meta-research continues to be conducted to identify the roots of
the crisis and to address them. Methods of addressing the crisis include
pre-registration of scientific studies and clinical trials as well as the founding of organizations such as CONSORT and the EQUATOR Network
that issue guidelines for methodology and reporting. There are
continuing efforts to reform the system of academic incentives, to
improve the peer review process, to reduce the misuse of statistics, to combat bias in scientific literature, and to increase the overall quality and efficiency of the scientific process.
Tackling publication bias with pre-registration of studies
A recent innovation in scientific publishing to address the replication crisis is through the use of registered reports.
The registered report format requires authors to submit a description
of the study methods and analyses prior to data collection. Once the
method and analysis plan is vetted through peer-review, publication of
the findings is provisionally guaranteed, based on whether the authors
follow the proposed protocol. One goal of registered reports is to
circumvent the publication bias
toward significant findings that can lead to implementation of
questionable research practices and to encourage publication of studies
with rigorous methods.
The journal Psychological Science has encouraged the preregistration of studies and the reporting of effect sizes and confidence intervals.
The editor in chief also noted that the editorial staff will be asking
for replication of studies with surprising findings from examinations
using small sample sizes before allowing the manuscripts to be
published.
Moreover, only a very small proportion of academic journals in
psychology and neurosciences explicitly stated that they welcome
submissions of replication studies in their aim and scope or
instructions to authors.This phenomenon does not encourage the reporting or even attempt on replication studies.
Shift to a complex systems paradigm
It has been argued that research endeavours working within the
conventional linear paradigm necessarily end up in replication
difficulties.
Problems arise if the causal processes in the system under study are
"interaction-dominant" instead of "component dominant", multiplicative
instead of additive, and with many small non-linear interactions
producing macro-level phenomena, that are not reducible to their
micro-level components. In the context of such complex systems,
conventional linear models produce answers that are not reasonable,
because it is not in principle possible to decompose the variance as
suggested by the General Linear Model
(GLM) framework – aiming to reproduce such a result is hence evidently
problematic. The same questions are currently being asked in many fields
of science, where researchers are starting to question assumptions
underlying classical statistical methods.
Emphasizing replication attempts in teaching
Based on coursework in experimental methods at MIT, Stanford, and the University of Washington,
it has been suggested that methods courses in psychology and other
fields emphasize replication attempts rather than original studies.
Such an approach would help students learn scientific methodology and
provide numerous independent replications of meaningful scientific
findings that would test the replicability of scientific findings. Some
have recommended that graduate students should be required to publish a
high-quality replication attempt on a topic related to their doctoral
research prior to graduation.
Reducing the p-value required for claiming significance of new results
Many publications require a p-value of p < 0.05 to claim statistical significance. The paper "Redefine statistical significance",
signed by a large number of scientists and mathematicians, proposes
that in "fields where the threshold for defining statistical
significance for new discoveries is p < 0.05, we propose a change to p < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields."
Their rationale is that "a leading cause of non-reproducibility
(is that the) statistical standards of evidence for claiming new
discoveries in many fields of science are simply too low. Associating
'statistically significant' findings with p < 0.05 results in a
high rate of false positives even in the absence of other experimental,
procedural and reporting problems."
This call was subsequently criticised by another large group, who
argued that "redefining" the threshold would not fix current problems,
would lead to some new ones, and that in the end, all thresholds needed
to be justified case-by-case instead of following general conventions.
Addressing the misinterpretation of p-values
Although statisticians are unanimous that use of the p <
0.05 provides weaker evidence than is generally appreciated, there is a
lack of unanimity about what should be done about it. Some have
advocated that Bayesian methods should replace p-values. This has
not happened on a wide scale, partly because it is complicated, and
partly because many users distrust the specification of prior
distributions in the absence of hard data. A simplified version of the
Bayesian argument, based on testing a point null hypothesis was
suggested by Colquhoun (2014, 2017). The logical problems of inductive inference were discussed in "The problem with p-values" (2016).
The hazards of reliance on p-values were emphasized by pointing out that even observation of p = 0.001 was not necessarily strong evidence against the null hypothesis.
Despite the fact that the likelihood ratio in favour of the alternative
hypothesis over the null is close to 100, if the hypothesis was
implausible, with a prior probability of a real effect being 0.1, even
the observation of p = 0.001 would have a false positive risk of 8 percent. It would not even reach the 5 percent level.
It was recommended that the terms "significant" and "non-significant" should not be used. p-values
and confidence intervals should still be specified, but they should be
accompanied by an indication of the false positive risk. It was
suggested that the best way to do this is to calculate the prior
probability that would be necessary to believe in order to achieve a
false positive risk of, say, 5%. The calculations can be done with R scripts that are provided, or, more simply, with a web calculator. This so-called reverse Bayesian approach, which was suggested by Matthews (2001), is one way to avoid the problem that the prior probability is rarely known.
Encouraging larger sample sizes
To improve the quality of replications, larger sample sizes than those used in the original study are often needed. Larger sample sizes are needed because estimates of effect sizes
in published work are often exaggerated due to publication bias and
large sampling variability associated with small sample sizes in an
original study. Further, using significance thresholds
usually leads to inflated effects, because particularly with small
sample sizes, only the largest effects will become significant.
Sharing raw data in online repositories
Online
repositories where data, protocols, and findings can be stored and
evaluated by the public seek to improve the integrity and
reproducibility of research. Examples of such repositories include the Open Science Framework, Registry of Research Data Repositories,
and Psychfiledrawer.org. Sites like Open Science Framework offer badges
for using open science practices in an effort to incentivize
scientists. However, there has been concern that those who are most
likely to provide their data and code for analyses are the researchers
that are likely the most sophisticated.
John Ioannidis at Stanford University suggested that "the paradox may
arise that the most meticulous and sophisticated and method-savvy and
careful researchers may become more susceptible to criticism and
reputation attacks by reanalyzers who hunt for errors, no matter how
negligible these errors are".
Funding for replication studies
In July 2016 the Netherlands Organisation for Scientific Research
made €3 million available for replication studies. The funding is for
replication based on reanalysis of existing data and replication by
collecting and analysing new data. Funding is available in the areas of
social sciences, health research and healthcare innovation.
Marcus R. Munafò and George Davey Smith argue, in a piece published by Nature, that research should emphasize triangulation, not just replication. They claim that,
replication alone will get us only
so far (and) might actually make matters worse ... We believe that an
essential protection against flawed ideas is triangulation. This is the
strategic use of multiple approaches to address one question. Each
approach has its own unrelated assumptions, strengths and weaknesses.
Results that agree across different methodologies are less likely to be artefacts. ...
Maybe one reason replication has captured so much interest is the
often-repeated idea that falsification is at the heart of the scientific
enterprise. This idea was popularized by Karl Popper's 1950s maxim that theories can never be proved, only falsified.
Yet an overemphasis on repeating experiments could provide an unfounded
sense of certainty about findings that rely on a single approach. ...
philosophers of science have moved on since Popper. Better descriptions
of how scientists actually work include what epistemologist Peter Lipton called in 1991 "inference to the best explanation".
Raise the overall standards of methods presentation
Some
authors have argued that the insufficient communication of experimental
methods is a major contributor to the reproducibility crisis and that
improving the quality of how experimental design and statistical
analyses are reported would help improve the situation.
These authors tend to plea for both a broad cultural change in the
scientific community of how statistics are considered and a more
coercive push from scientific journals and funding bodies.
Implications for the pharmaceutical industry
Pharmaceutical
companies and venture capitalists maintain research laboratories or
contract with private research service providers (e.g. Envigo
and Smart Assays Biotechnologies) whose job is to replicate academic
studies, in order to test if they are accurate prior to investing or
trying to develop a new drug based on that research. The financial
stakes are high for the company and investors, so it is cost effective
for them to invest in exact replications.
Execution of replication studies consume resources. Further, doing an
expert replication requires not only generic expertise in research
methodology, but specific expertise in the often narrow topic of
interest. Sometimes research requires specific technical skills and
knowledge, and only researchers dedicated to a narrow area of research
might have those skills. Right now, funding agencies are rarely
interested in bankrolling replication studies, and most scientific
journals are not interested in publishing such results.
Amgen Oncology's cancer researchers were only able to replicate 11
percent of the innovative studies they selected to pursue over a 10-year
period;
a 2011 analysis by researchers with pharmaceutical company Bayer found
that the company's in-house findings agreed with the original results
only a quarter of the time, at the most.
The analysis also revealed that, when Bayer scientists were able to
reproduce a result in a direct replication experiment, it tended to
translate well into clinical applications; meaning that reproducibility
is a useful marker of clinical potential.