The replication crisis (or replicability crisis or reproducibility crisis) refers to a methodological crisis in science in which scientists have found that the results of many scientific studies are difficult or impossible to replicate/reproduce on subsequent investigation, either by independent researchers or by the original researchers themselves.[1][2] The crisis has long-standing roots; the phrase was coined in the early 2010s[3] as part of a growing awareness of the problem.
Because the reproducibility of experiments is an essential part of the scientific method,[4] the inability to replicate the studies of others has potentially grave consequences for many fields of science in which significant theories are grounded on unreproducible experimental work.
The replication crisis has been particularly widely discussed in the field of psychology (and in particular, social psychology) and in medicine, where a number of efforts have been made to re-investigate classic results, and to attempt to determine both the reliability of the results, and, if found to be unreliable, the reasons for the failure of replication.[5][6]
Scope of the crisis
Overall
According to a 2016 poll of 1,500 scientists reported in the journal Nature, 70% of them had failed to reproduce at least one other scientist's experiment (50% had failed to reproduce one of their own experiments).- chemistry: 90% (60%),
- biology: 80% (60%),
- physics and engineering: 70% (50%),
- medicine: 70% (60%),
- Earth and environment science: 60% (40%).
In medicine
Out of 49 medical studies from 1990–2003, with more than 1000 citations, 45 claimed that studied therapy was effective. Out of these studies, 16% were contradicted by subsequent studies, 16% had found stronger effects than did subsequent studies, 44% were replicated, and 24% remained largely unchallenged.[9] Food and Drug Administration in 1977–90 found flaws in 10–20% of medical studies.[10] In a paper published in 2012, Glenn Begley, a biotech consultant working at Amgen, and Lee Ellis, at the University of Texas, argued that only 11% of the pre-clinical cancer studies could be replicated.[11][12]A 2016 article by John Ioannidis, Professor of Medicine and of Health Research and Policy at Stanford University School of Medicine and a Professor of Statistics at Stanford University School of Humanities and Sciences, elaborated on "Why Most Clinical Research Is Not Useful".[13] In the article Ioannidis laid out some of the problems and called for reform, characterizing certain points for medical research to be useful again – one example he made was the need for medicine to be "Patient Centered" (e.g. in the form of the Patient-Centered Outcomes Research Institute) instead of the current practice to mainly take care of "the needs of physicians, investigators, or sponsors". Ioannidis is known for his research focus on science itself since the 2005 paper "Why Most Published Research Findings Are False".[14]
In psychology
Replication failures are not unique to psychology and are found in all fields of science.[15] However, several factors have combined to put psychology at the center of controversy. Much of the focus has been on the area of social psychology,[16] although other areas of psychology such as clinical psychology have also been implicated.Firstly, questionable research practices (QRPs) have been identified as common in the field.[17] Such practices, while not intentionally fraudulent, involve capitalizing on the gray area of acceptable scientific practices or exploiting flexibility in data collection, analysis, and reporting, often in an effort to obtain a desired outcome. Examples of QRPs include selective reporting or partial publication of data (reporting only some of the study conditions or collected dependent measures in a publication), optional stopping (choosing when to stop data collection, often based on statistical significance of tests), p-value rounding (rounding p-values down to .05 to suggest statistical significance), file drawer effect (nonpublication of data), post-hoc storytelling (framing exploratory analyses as confirmatory analyses), and manipulation of outliers (either removing outliers or leaving outliers in a dataset to cause a statistical test to be significant).[17][18][19][20] A survey of over 2,000 psychologists indicated that a majority of respondents admitted to using at least one QRP.[17] False positive conclusions, often resulting from the pressure to publish or the author's own confirmation bias, are an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers.[21]
Secondly, psychology and social psychology in particular, has found itself at the center of several scandals involving outright fraudulent research, most notably the admitted data fabrication by Diederik Stapel[22] as well as allegations against others. However, most scholars acknowledge that fraud is, perhaps, the lesser contribution to replication crises.
Third, several effects in psychological science have been found to be difficult to replicate even before the current replication crisis. For example the scientific journal Judgment and Decision Making has published several studies over the years that fail to provide support for the unconscious thought theory. Replications appear particularly difficult when research trials are pre-registered and conducted by research groups not highly invested in the theory under questioning.
These three elements together have resulted in renewed attention for replication supported by Kahneman.[23] Scrutiny of many effects have shown that several core beliefs are hard to replicate. A recent special edition of the journal Social Psychology focused on replication studies and a number of previously held beliefs were found to be difficult to replicate.[24] A 2012 special edition of the journal Perspectives on Psychological Science also focused on issues ranging from publication bias to null-aversion that contribute to the replication crises in psychology.[25] In 2015, the first open empirical study of reproducibility in Psychology was published, called the Reproducibility Project. Researchers from around the world collaborated to replicate 100 empirical studies from three top Psychology journals. Fewer than half of the attempted replications were successful at producing statistically significant results in the expected directions, though most of the attempted replications did produce trends in the expected directions.[26]
Scholar James Coyne has recently written that many research trials and meta-analyses are compromised by poor quality and conflicts of interest that involve both authors and professional advocacy organizations, resulting in many false positives regarding the effectiveness of certain types of psychotherapy.[27]
The replication crisis does not necessarily mean that psychology is unscientific.[28][29][30] Rather this process is a healthy if sometimes acrimonious part of the scientific process in which old ideas or those that cannot withstand careful scrutiny are pruned,[31][32] although this pruning process is not always effective.[33][34] The consequence is that some areas of psychology once considered solid, such as social priming, have come under increased scrutiny due to failed replications.[35] The British Independent newspaper wrote that the results of the reproducibility project show that much of the published research is just "psycho-babble".[36]
Nobel laureate and professor emeritus in psychology Daniel Kahneman argued that the original authors should be involved in the replication effort because the published methods are often too vague.[37] Some others scientists, like Dr. Andrew Wilson disagree and argue that the methods should be written down in detail. An investigation of replication rates in psychology in 2012 indicated higher success rates of replication in replication studies when there was author overlap with the original authors of a study[38] (91.7% successful replication rates in studies with author overlap compared to 64.6% success replication rates without author overlap).
Psychology replication rates
A report by the Open Science Collaboration in August 2015 that was coordinated by Brian Nosek estimated the reproducibility of 100 studies in psychological science from three high-ranking psychology journals.[39] Overall, 36% of the replications yielded significant findings (p value below .05) compared to 97% of the original studies that had significant effects. The mean effect size in the replications was approximately half the magnitude of the effects reported in the original studies.The same paper examined the reproducibility rates and effect sizes by journal (Journal of Personality and Social Psychology [JPSP], Journal of Experimental Psychology: Learning, Memory, and Cognition [JEP:LMC], Psychological Science [PSCI]) and discipline (social psychology, cognitive psychology). Study replication rates were 23% for JPSP, 38% for JEP:LMC, and 38% for PSCI. Studies in the field of cognitive psychology had a higher replication rate (50%) than studies in the field of social psychology (25%).
An analysis of the publication history in the top 100 psychology journals between 1900 and 2012 indicated that approximately 1.6% of all psychology publications were replication attempts.[38] Articles were considered a replication attempt if the term "replication" appeared in the text. A subset of those studies (500 studies) was randomly selected for further examination and yielded a lower replication rate of 1.07% (342 of the 500 studies [68.4%] were actually replications). In the subset of 500 studies, analysis indicated that 78.9% of published replication attempts were successful. The rate of successful replication was significantly higher when at least one author of the original study was part of the replication attempt (91.7% relative to 64.6%).
A disciplinary social dilemma
Highlighting the social structure that discourages replication in psychology, Brian D. Earp and Jim A. C. Everett enumerated five points as to why replication attempts are uncommon[40][41]- "Independent, direct replications of others’ findings can be time-consuming for the replicating researcher
- "[Replications] are likely to take energy and resources directly away from other projects that reflect one’s own original thinking
- "[Replications] are generally harder to publish (in large part because they are viewed as being unoriginal)
- "Even if [replications] are published, they are likely to be seen as 'bricklaying' exercises, rather than as major contributions to the field
- "[Replications] bring less recognition and reward, and even basic career security, to their authors"[42] For these reasons the authors advocated that psychology is facing a disciplinary social dilemma, where the interests of the discipline are at odds with the interests of the individual researcher.
"Methodological terrorism" controversy
With the replication crisis of psychology earning attention, Princeton University psychologist Susan Fiske drew controversy for calling out critics of psychology.[43][44][45][46] She called these unnamed "adversaries" names such as "methodological terrorist" and "self-appointed data police", and said that criticism of psychology should only be expressed in private or through contacting the journals.[43] Columbia University statistician and political scientist Andrew Gelman, "well-respected among the researchers driving the replication debate", responded to Fiske, saying that she had found herself willing to tolerate the "dead paradigm" of faulty statistics and had refused to retract publications even when errors were pointed out.[43][47] He added that her tenure as editor has been abysmal and that a number of published papers edited by her were found to be based on extremely weak statistics; one of Fiske's own published papers had a major statistical error and "impossible" conclusions.[43]In marketing
Marketing is another discipline with a "desperate need" for replication.[48] Many famous marketing studies fail to be repeated upon replication, a notable example being the "too-many-choices" effect, in which a high number of choices of product makes a consumer less likely to purchase.[49] In addition to the previously mentioned arguments, replications studies in marketing are needed to examine the applicability of theories and models across countries and cultures, which is especially important because of possible influences of globalization.[50]In economics
A 2016 study in the journal Science found that two-thirds of 18 experimental studies from two top-tier economics journals (American Economic Review and the Quarterly Journal of Economics) successfully replicated.[51][52] A 2017 study in the Economic Journal suggested that "the majority of the average effects in the empirical economics literature are exaggerated by a factor of at least 2 and at least one-third are exaggerated by a factor of 4 or more".[53]In sports science
A 2018 study took the field of exercise and sports science to task for insufficient replication studies, limited reporting of null results and trivial results, and insufficient research transparency.[54] Statisticians have criticized sports science for common use of an invalid statistical method called "magnitude-based inference" that has allowed sports scientists to extract spurious results that appear to be meaningful from noisy data.[55]Causes of the crisis
In a work published in 2015 Glenn Begley and John Ioannidis offer five bullets as to summarize the present predicaments:[56]- Generation of new data/ publications at an unprecedented rate.
- Compelling evidence that the majority of these discoveries will not stand the test of time.
- Causes: failure to adhere to good scientific practice & the desperation to publish or perish.
- This is a multifaceted, multistakeholder problem.
- No single party is solely responsible, and no single solution will suffice.
Philosopher and historian of science Jerome R. Ravetz predicted in his 1971 book Scientific Knowledge and Its Social Problems that science – in moving from the little science made of restricted communities of scientists to big science or techno-science – would suffer major problems in its internal system of quality control. Ravetz anticipated that modern science's system of rewarding scientists for research might become dysfunctional, the present 'publish or perish' challenge, creating perverse incentives to publish any findings however dubious. For Ravetz quality in science is maintained when there is a community of scholars linked by norms and standards, and a willingness to stand by these.
Historian Philip Mirowski offered more recently a similar diagnosis in his 2011 book Science Mart (2011).[60] 'Mart' is here a reference to the retail giant 'Walmart' and an allusion to the commodification of science. In the analysis of Mirowski when science becomes a commodity being traded in a market its quality collapses. Mirowski argues his case by tracing the decay of science to the decision of major corporations to close their in house laboratories in order to outsource their work to universities, and subsequently to move their research away from universities to even cheaper contract research organization (CRO).
The crisis of science's quality control system is affecting the use of science for policy. This is the thesis of a recent work by a group of STS scholars, who identify in 'evidence based (or informed) policy' a point of present tension.[61][62] Economist Noah Smith suggests that a factor in the crisis has been the overvaluing of research in academia and undervaluing of teaching ability, especially in fields with few major recent discoveries.[63]
Addressing the replication crisis
Replication has been referred to as "the cornerstone of science".[64][65] Replication studies attempt to evaluate whether published results reflect true findings or false positives. The integrity of scientific findings and reproducibility of research are important as they form the knowledge foundation on which future studies are built.Tackling publication bias with pre-registration of studies
A recent innovation in scientific publishing to address the replication crisis is through the use of registered reports.[66][67] The registered report format requires authors to submit a description of the study methods and analyses prior to data collection. Once the method and analysis plan is vetted through peer-review, publication of the findings is provisionally guaranteed, based on whether the authors follow the proposed protocol. One goal of registered reports is to circumvent the publication bias toward significant findings that can lead to implementation of Questionable Research Practices and to encourage publication of studies with rigorous methods.The journal Psychological Science has encouraged the preregistration of studies and the reporting of effect sizes and confidence intervals.[68] The editor in chief also noted that the editorial staff will be asking for replication of studies with surprising findings from examinations using small sample sizes before allowing the manuscripts to be published.
Moreover, only a very small proportion of academic journals in psychology and neurosciences explicitly stated that they welcome submissions of replication studies in their aim and scope or instructions to authors.[69][70] This phenomenon does not encourage the reporting or even attempt on replication studies.
Emphasizing replication attempts in teaching
Based on coursework in experimental methods at MIT and Stanford, it has been suggested that methods courses in psychology emphasize replication attempts rather than original studies.[71][72] Such an approach would help students learn scientific methodology and provide numerous independent replications of meaningful scientific findings that would test the replicability of scientific findings. Some have recommended that graduate students should be required to publish a high-quality replication attempt on a topic related to their doctoral research prior to graduation.[41]Reducing the p-value required for claiming significance of new results
Many publications require a p-value of p < 0.05 to claim statistical significance. The paper "Redefine statistical significance",[73] signed by a large number of scientists and mathematicians, proposes that in "fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields."Their rationale is that "a leading cause of non-reproducibility (is that the) statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating 'statistically significant' findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems."
Addressing the misinterpretation of p-values
Although statisticians are unanimous that use of the p < 0.05 provides weaker evidence than is generally appreciated, there is an unfortunate lack of unanimity about what should be done about it. Some have advocated that Bayesian methods should replace p-values. This has not happened on a wide scale, partly because it is complicated, and partly because many users distrust the specification of prior distributions in the absence of hard data. A simplified version of the Bayesian argument, based on testing a point null hypothesis was suggested by Colquhoun (2014, 2017).[74][75] The logical problems of inductive inference were discussed in "The problem with p-values" (2016).[76]The hazards of reliance on p-values was emphasized by pointing out that even observation of p = 0.001 was not necessarily strong evidence against the null hypothesis.[75] Despite the fact that the likelihood ratio in favour of the alternative hypothesis over the null is close to 100, if the hypothesis was implausible, with a prior probability of a real effect being 0.1, even the observation of p = 0.001 would have a false positive risk of 8 percent. It wouldn't even reach the 5 percent level.
It was recommended[75] that the terms "significant" and "non-significant" should not be used. p-values and confidence intervals should still be specified, but they should be accompanied by an indication of the false positive risk. It was suggested that the best way to do this is to calculate the prior probability that would be necessary to believe in order to achieve a false positive risk of, say, 5%. The calculations can be done with R scripts that are provided,[75] or, more simply, with a web calculator.[77] This so-called reverse Bayesian approach, which was suggested by Matthews (2001),[78] is one way to avoid the problem that the prior probability is rarely known.
Encouraging use of larger sample sizes
To improve the quality of replications, larger sample sizes than those used in the original study are often needed.[79] Larger sample sizes are needed because estimates of effect sizes in published work are often exaggerated due to publication bias and large sampling variability associated with small sample sizes in an original study.[80][81][81][82] Further, using significance thresholds usually leads to inflated effects, because particularly with small sample sizes, only the largest effects will become significant.[83]Sharing raw data in online repositories
Online repositories where data, protocols, and findings can be stored and evaluated by the public seek to improve the integrity and reproducibility of research. Examples of such repositories include the Open Science Framework, Registry of Research Data Repositories, and Psychfiledrawer.org. Sites like Open Science Framework offer badges for using open science practices in an effort to incentivize scientists. However, there has been concern that those who are most likely to provide their data and code for analyses are the researchers that are likely the most sophisticated.[84] John Ioannidis at Stanford University suggested that "the paradox may arise that the most meticulous and sophisticated and method-savvy and careful researchers may become more susceptible to criticism and reputation attacks by reanalyzers who hunt for errors, no matter how negligible these errors are".[84]Funding for replication studies
In July 2016 the Netherlands Organisation for Scientific Research made 3 million Euros available for replication studies. The funding is for replication based on reanalysis of existing data and replication by collecting and analysing new data. Funding is available in the areas of social sciences, health research and healthcare innovation.[85]In 2013 the Laura and John Arnold Foundation funded the launch of The Center for Open Science with a $5.25 million grant and by 2017 had provided an additional $10 million in funding.[86] It also funded the launch of the Meta-Research Innovation Center at Stanford at Stanford University run by John Ioannidis and Steven Goodman to study ways to improve scientific research.[86] It also provided funding for the AllTrials initiative led in part by Ben Goldacre.[86]
Emphasize triangulation, not just replication
Marcus R. Munafò and George Davey Smith argue, in a piece published by Nature, that research should emphasize triangulation, not just replication. They claim that,"replication alone will get us only so far (and) might actually make matters worse... We believe that an essential protection against flawed ideas is triangulation. This is the strategic use of multiple approaches to address one question. Each approach has its own unrelated assumptions, strengths and weaknesses. Results that agree across different methodologies are less likely to be artifacts.... Maybe one reason replication has captured so much interest is the often-repeated idea that falsification is at the heart of the scientific enterprise. This idea was popularized by Karl Popper's 1950s maxim that theories can never be proved, only falsified. (Yet) an overemphasis on repeating experiments could provide an unfounded sense of certainty about findings that rely on a single approach.... philosophers of science have moved on since Popper. Better descriptions of how scientists actually work include what epistemologist Peter Lipton called in 1991 "inference to the best explanation" (instead.)"[87]