Is science in the throes of a “reproducibility crisis”? If someone followed the same methods, techniques, and reagents as your experiment, would they have the same results? This describes the core concept of reproducibility, also known as replication. Integrity and trust in scientific research findings depend on research results being reproducible. But for a discipline that prides itself on the “unbiased pursuit of knowledge and truth”, the fact that up to 65% of researchers have tried and failed to reproduce their own research is astonishing, to say the least.
Scientists have tried to tackle this problem from every angle, addressing wide-ranging issues from publication bias, inflated false-positive rates and over-confidence in statistical tests. The extensive news coverage, review articles, and meta-analysis studies addressing this issue have made clear the need for widespread reform and change – and yet nothing has really changed. The systems and institutions that fund, support and publish scientific research continue to not just turn a blind eye to bad science; in some cases, systemic flaws exacerbate the problem. The way in which we do science stays the same, and as research builds upon previous research, the problem is spiralling out of control. Not to mention the cost; it’s estimated in the US alone, research that cannot be reproduced is wasting 28 billion USD in research funding every year.
In this blog post, we will consider the causes of this crisis, including a “publish or perish” culture that incentivizes bad practices, p-hacking, HARKing and other methods by which researchers can over-inflate positive results, as well as the lack of transparency, detail and accessibility that makes attempting to reproduce results impossible.
We will highlight cancer research and Alzheimer’s research as two fields which have had huge investments followed by huge failures, and how problems with reproducibility have contributed to these failures. We will also address how the emergence and development of new technologies such as AI and machine learning aren’t necessarily helping the problem; they may in fact be making matters worse. We also give voice to the other perspective, from those who think the “reproducibility crisis” is an exaggeration or a misnomer. We continue by outlining initiatives and projects designed to fix this crisis. Finally, we ask ourselves the question: How much of our current science is built on fundamental concepts and ideas that are flawed?
Publish or perish: A culture of perverse incentives
In academia, in industry, and across all different kinds of institutions, the pressure to publish is immense. Publish to get more funding, publish to secure your job, publish to climb the ranks, and if you want to get published, the work needs to be original, novel, impactful and most importantly, statistically significant.
The “positive-results bias” is a well-documented phenomenon that describes the bias toward positive or statistically significant results that is endemic in science publishing. First described in 1959, this problem has worsened over the past decade: A study analysing over 4500 papers from over 70 countries showed that positive support for a hypothesis has increased by around 6% every year. According to another study, the proportion of positive results in published literature is close to 85%, despite the fact that the average statistical power is low, estimated to be 8-35%. In other words, the average probability that a study will correctly reject the null hypothesis is low – and if there were no publication bias, the proportion of significant results in literature would roughly match the average statistical power of the field itself. This distortion in literature is also dangerous for another reason. It paints a false picture of the field and science in general as a story of constant success – when many a frustrated researcher could tell you that is not the case.
Moreover, withholding negative results, which are then relegated to the “file drawer”, never to be seen again, causes many other problems. Think of the wasted resources as labs undertake research in futile directions – that someone has already proven futile – all because that information wasn’t deemed important enough to be made public. One can imagine that many individuals have separately come to the same conclusion many times – and wasted time and money in doing so.
Another responsibility, often overlooked, is a researcher’s ethical duty to research participants. Research participants, who in some cases are suffering from debilitating conditions and taking on considerable risk, are doing so under the impression they are contributing to advances in scientific knowledge. If that study so happens to find that its result doesn’t reach that p value threshold to be deemed significant, and is consequently relegated to the file drawer, this raises a moral objection. Even in the case of animal studies, animals dying for experiments, the results of which never see the light of day are also morally objectionable – and the aforementioned wasted resources take on a darker note.
Lies, damned lies, and statistics: Over-confidence and inflated results
The pressure to publish and difficulties in securing funding means that many researchers cannot afford for their studies to be relegated to the file drawer. As a result, they may engage in certain practices which boost the likelihood of exciting and most importantly, significant results. However, these methods have knock-on effects, causing inflated false-positive results, and diminishing the quality and integrity of scientific research.
“Why most published research findings are false”, was the bold (and perhaps deliberately provocative) title of Stanford epidemiologist John Ioannidis’ 2005 essay– and one of the major culprits he identified was “p-hacking”. This was reaffirmed by another 2015 study, which used text-mining technology to show that p-hacking is widespread in scientific research. So, what exactly is “p-hacking”?
P-hacking refers to “hacking” the p value – a measure of significance scientists have used for almost a century. Researchers can do this by being selective about the data they analyse and collect, as well as the statistical tests they use, manipulating them both until nonsignificant results become significant. This can be done by checking the statistical significance of results before deciding whether to collect or include more data, adjusting statistical models by excluding “outliers” based on the resulting analysis, and rounding of p values to meet the statistical significance threshold (e.g., presenting 0.052 as P<0.05).
A 2017 meta-analysis found that 43% of researchers have been guilty of HARKing their results at least once in their career. HARKing – an acronym for Hypothesising After Results are Known – involves presenting ad hoc and/or unexpected findings as though they had been predicted all along. In other words, using results to develop hypotheses. The problem with HARKing is best described by Kevin Murphy, a psychologist at the University of Limerick who explains, “HARKing creates the real possibility that you will make up a story to ‘explain’ what is essentially a sampling error. This will mislead you and your readers into thinking that you have gained some new understanding of a meaningful phenomenon.”
“Cherry picking” is perhaps the most well-known method of manipulating results and occurs when researchers only publish and include results that best support their hypothesis. Cherry picking is particularly widespread in pre-clinical cancer research, and has gained notoriety recently thanks to the politics of COVID-19, with certain politicians using cherry picked statistics to sell a certain narrative. But this happens across many fields in science research too, and this practice can make unsuccessful experiments look convincingly successful and lead researchers to make wide-reaching generalisations that don’t hold up under scrutiny.
It’s worth noting the recent movement against the p value. In 2016, the American Statistical Association released a statement warning researchers against over-confidence in and misuse of
p values. In 2019, more than 800 researchers called for the entire concept of statistical significance to be abandoned altogether: “calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis”. This movement highlights how categorizing findings into significant and non-significant categories creates a false dichotomy – just because the p value is less than 0.05, it doesn’t automatically make those findings legitimate.
As psychologists Rosnow and Rosenthal decried in 1989 “surely God loves the .06 nearly as much as the 0.5”. And whilst the p value isn’t a completely arbitrary measure, having such a set-in-stone approach doesn’t allow for complexity in a field where things are more often than not, shades of grey, not black and white. As we’ve discussed already, this threshold doesn’t help with publication bias. In fact, having such a definitive bar that needs to be reached makes the situation all the worse.
Tell us the whole story! Transparency, detail, and accessibility
Other reproducibility issues include the lack of transparency, detail, and accessibility, each of which have been observed in studies across many different fields and disciplines. If a study does not rigorously detail study design, reagent lists, reference materials and laboratory protocols, it becomes very hard to reproduce. Simply put, if we don’t know how you got these results, how are we supposed to check they’re right?
When it comes to reagents, antibodies are a notable source of irreproducibility, particularly in preclinical studies. To blame? Inconsistent quality assurance, manufacturing practices, and variation in storage, use, and age of the antibodies – strains in funding mean researchers sometimes stretch the use of antibodies well beyond their “sell-by date”. The problem? Antibodies are used ubiquitously in scientific research, for western blotting, immunohisto/cytochemistry, immunoprecipitation, ELISAs, and the list goes on. Picture just how many studies are implicated by this one class of reagent alone. Methods-wise, variation in collection protocols, storage conditions, and lack of strict adherence to standard operating practices (SOPs) all affect reproducibility.
In our interview with Paul Agapow, Health Informatics Director at AstraZeneca, he speaks about how all scientific data can only be realised if it is “‘FAIR’, or Findable, Accessible, Interoperable and Reusable.” If the data used for analysis isn’t easily accessible, it becomes very hard to ensure that the results are valid. As the saying goes “garbage in, garbage out” – if the data used isn’t diverse, representative, and reliably collected, the analysis is going to be unreliable at best.
In 2020, Nature published an article in response to a study from Google Health, that had appeared in the journal earlier that year. This study claimed that they developed an AI system capable of surpassing human experts in breast cancer prediction – a strong claim to be sure. The response was swift, and damning. Scientists criticised the lack of information about the methods and data used in the study, characterising it as little more than promotion for Google’s new proprietary tech. Benjamin Haibe-Kains, a computational biology expert from the University of Toronto, summed up the growing frustration in the field, saying “We couldn’t take it anymore […] When we saw that paper from Google, we realized that it was yet another example of a very high-profile journal publishing a very exciting study that has nothing to do with science. It’s more an advertisement for cool technology. We can’t really do anything with it.”
Cancer research: The concerning irreproducibility of preclinical studies
In 2021, a study of high-impact papers, part of the “Reproducibility Project: Cancer Biology”, found that fewer than half the experiments assessed were reproducible. Furthermore, the study initially aimed to replicate 193 experiments, but could only repeat 50, due to a lack of adequate information about the methods, reagents, and data, despite efforts to contact authors of the original papers. This paper took 8 years to complete, and cost 2 million USD: On average the team needed 197 weeks to replicate a study, highlighting the need for funding, time, and resources in ensuring findings are reproducible. The low reproducibility rate is, as oncologist Glenn Begley puts it, is “frankly, outrageous.”
In 2012, Glenn Begley also co-authored a paper urging cancer researchers to raise standards for preclinical research; his team of researchers at biotech firm Amgen found that they could only reproduce results from 6 out of 53 landmark cancer research papers – that’s just 11%. As we’ve already discussed, this low rate of reproducibility calls into question the credibility of many of these highly cited, impactful papers.
Perhaps that’s not surprising – in 2019, it came to light that almost 97% of clinical cancer trials were unsuccessful. This failure has often been attributed to issues translating pre-clinical studies, unexpected problems that arise when we move from mice to men. However, perhaps these failures can be attributed to issues with reproducibility instead. As Claire Wilson writes in the New Scientist, “many people understand that promising results in mice may not translate to people. Before this [reproducibility crisis], I didn’t realise that results in mice may not even translate to other mice”.
Lost in translation: Why do Alzheimer’s drugs keep failing?
Research into Alzheimer’s disease has also seen high rates of failure in clinical studies, with 99% of all trials showing no differences between drugs and placebos. This is despite huge investment, nearly 3.1 billion USD every year in the United States alone. Much like cancer research, dependence on animal models for preclinical studies has been the major focus in articles attempting to explain this failure.
Again, like with cancer, irreproducibility may be an underappreciated driving force behind this lack of success. Projects such as AlzPED, the Alzheimer’s Disease Preclinical Efficacy Database, are working to fight that. AlzPED is a publicly available resource that aims to “increase the transparency, reproducibility and translatability of preclinical efficacy studies of candidate therapeutics for Alzheimer’s disease.”
However, you may have heard in the news something far more nefarious. A six-month investigation by Science claimed to have uncovered evidence that one of the most cited Alzheimer’s studies in history, may have been fabricated. Research by Sylvain Lesné, published in 2006 in Nature, was foundational in the development of the, until recently, widely accepted and de-facto reigning theory behind Alzheimer’s disease, the amyloid beta hypothesis. The investigation, which involved leading image analysts and Alzheimer’s researchers, has cast doubt on hundreds of images, 70 of which are in Lesné’s papers. Donna Wilcock, an Alzheimer’s expert at the University of Kentucky, characterized some of the images as “shockingly blatant” examples of image tampering.
Now, the question we should ask ourselves here is this: If reproducibility studies were standard, would we have raised questions about this research sooner? Clearly, the research environment needs to be more rigorously examined, especially if a high-profile study such as this was accepted as fact for over 15 years. Another blow to the trustworthiness and integrity of scientific research.
Machine Learning and AI: Harmful or helpful?
AI and machine learning tools are being increasingly used in research, and it doesn’t look like that’s slowing down. We’ve already briefly discussed one instance where lack of transparency and bold claims by Google Health’s AI tech weren’t reproducible. But this problem extends beyond just one instance – and a lack of checks and balances is fuelling a data leakage and reproducibility crisis.
AI and machine learning researchers have been ringing the alarm bells over the past 2 years. In 2019, Computational Biology expert Genevera Allen warned the scientific community about the irreproducibility of many machine-lerning studies at the 2019 American Association for the Advancement of Science. On the topic of cancer research, she said “There are cases where discoveries aren’t reproducible; the clusters discovered in one study are completely different than the clusters found in another. Why? Because most machine-learning techniques today always say, ‘I found a group.’ Sometimes, it would be far more useful if they said, ‘I think some of these are really grouped together, but I’m uncertain about these others.
That same year AI researcher Joelle Pineau led an effort to encourage AI researchers to open up their code, so studies could be replicated. In an article for Nature, she discusses why algorithms, unlike natural phenomena should always come up with the same results – but many studies using AI are irreproducible, stating, “It’s true that with code, you press start and, for the most part, it should do the same thing every time. The challenge can be trying to reproduce a precise set of instructions in machine code from a paper. […] papers don’t always give all the detail, or give misleading detail. Sometimes it’s unintentional and perhaps sometimes it’s towards making the results look more favourable. That’s a big issue.”
And a recent paper corroborates these concerns; as researchers Kapoor and Narayanan discuss in their interview with Nature. We briefly covered the findings of the study in our news article. In the study they assessed 20 reviews across 17 different fields, implicating over 300 studies as not having replicable results. They identified several different areas for concern, including over-confidence and a lack of expert knowledge as machine-learning tools are often marketed as off-the-shelf tools, that can be used with only a few hours of training. However, the main culprit was a lack of regulations and checks to ensure that studies are reproducible – so they created their own checklist, a model info sheet. The model info sheet is essentially a comprehensive taxonomy of data leakage errors, summarised to create a set of guidelines, which, if consistently applied, will help root out any potential reproducibility problems.
Another perspective: Arguments against the “so-called crisis”
However not everyone is on board with the concerns about the “reproducibility crisis” – In fact some scientists reject the concept altogether. Alexander Bird, Philosopher from King’s College London argues in his paper that the high rate of failed replications is due to issues with the scientific process. He argues that the “high rate of failed replications is consistent with high-quality science” and that the real reason for irreproducibility is due to the base rate fallacy. This is a cognitive problem, where we as individuals tend to give more weight to the event-specific information than we should, and sometimes even ignore base rates entirely – we prioritise some information as important, and other information as irrelevant.
However, there have been responses to Bird’s paper, some of which highlight how his argument is incomplete, and how such a simple explanation cannot explain how widespread a problem reproducibility has become. Other scientists have argued that the term “reproducibility crisis” isn’t descriptive enough, or is too negative, opting to call this an “innovation opportunity” instead. But these are mostly arguments around semantics.
Where do we go from here? Initiatives and projects fighting the problem
So, what’s the solution? We can all imagine how these intense pressures incentivise scientists to do the very things they shouldn’t do – become biased, and ensure their expensive, resource-heavy, time-exhaustive experiments, come up roses. Even the subtlest forms of bias can cripple the reproducibility and consequent credibility of a study, but it’s short-sighted and wrong to blame it all on the individual. Sometimes these biases aren’t even conscious decisions or are just a momentary lapse in judgement.
That’s the problem with a lot of the discussion and suggestions for tackling this crisis – they are all centred around the individual researcher. How can you make your research more replicable? What can you do to tackle this crisis? Whilst I agree that researchers all hold a personal level responsibility, I believe the only way to ensure and establish a standard for reproducibility is structural and systemic changes to reform the way we do science. Global institutions such as the International Organization for Standardization could leverage their muscle to put in concrete regulations. Some scientists have started projects which have created badges to mark and promote studies which have been reproduced successfully. Pre-registration and working towards open-sourcing data are other solutions.
Another issue that is not discussed often in the context of reproducibility is the barrier to accessing certain studies. Whilst those at specific institutions and companies can afford to pay for access to papers, this is not always the case, especially when the bill can be 11 million USD a year. Core to the ethos of scientific research is advancing knowledge. But how can we advance knowledge, when only a select few can access it? Elitism aside, this does spell problems for reproducibility; to ensure reproducibility, everyone should be able to access all research, so all research can be exposed to the same scrutiny, checks, and balances.
But if journals are going to continue hiding findings behind a paywall, do they not have a responsibility to ensure the results they publish are reproducible? The peer-review system is inadequate for the kind of comprehensive investigation needed to replicate a study, and time and funding is required for such an endeavour. Afterall, journals are the primary way research findings and insights are disseminated.
But perhaps it’s not in their monetary interest to enforce such rigorous checks; and the pace of publishing research has exponentially increased in the past 10 years. Journals argue that publishing research already comes with steep enough costs, but that being said, in 2018 alone, Elsevier’s revenue was a total of 3.2 billion USD, with a net profit of 19%. I’ll leave you to draw your own conclusions.
Back to the fundamentals
A final note to close this blog post off. We discussed Alzheimer’s research, which is currently in the midst of a long hard look in the mirror. But one lesson we can take away from this troubling story, is that widely accepted theories and dogmas need to be challenged. We need to constantly revisit “the fundamentals” – the foundational theories and reasonings that drive the rationale and direction of current research.
We are undervaluing the negative result. Negative results can be meaningful – precisely because they challenge those widely accepted, perceptively irrefutable dogmas. And negative results don’t just call accepted theories into question, but also help in the interpretation of other results, can help reduce wasted resources, and help other scientists better design their studies.
We’ve discussed the role publication bias and other pressures have to play in our neglect of the negative result. But another culprit, not often discussed, are the funding bodies and their obsession with therapeutic targets. This linear, target-driven, focussed form of research doesn’t allow for wiggle room. It doesn’t allow for exploratory research. It doesn’t allow for negative results. And more often than not you have to build on other research, citing widely accepted “truths” to get that funding. We need the space to question, revisit, and reassess. But if the goal is to constantly push forward, we don’t allow for that opportunity.
As Paul Kalanithi, neurosurgeon and author, put it back in 2016, “Science, I had come to learn, is as political, competitive, and fierce a career as you can find, full of the temptation to find easy paths.” But science is a field where we deal with real diseases, and real patients. We need scientific research, knowledge and understanding to be concrete, thorough and rigorous, and to ensure that, we must resist those very temptations.