Written by Miyako Rogers, Science Writer
Machine learning is being increasingly used in research, but a recent article describes how a lack of checks and balances is causing a data leakage and reproducibility crisis. Reproducibility describes the ability to replicate the results of a study given the full details of the methodology. Data leakage occurs when data outside the training set ‘leaks’ into the study, causing results to be unreproducible. Researchers Sayash Kapoor and Arvind Narayanan discuss their research in a recent article for Nature: They have assessed 20 reviews across 17 different fields and implicated over 300 studies as not having replicable results.
Addressing the hype: Over-optimism and a lack of checks and balances
In recent years, we have seen an explosion in the use of machine learning tools in scientific research. Much of the appeal of machine learning has been that off-the-shelf tools, with a few hours of training, can be used without requiring expert knowledge. However, this lack of expertise has resulted in errors and data-leakage problems not correctly identified, particularly during the peer-review process. Furthermore, whilst most scientific papers thoroughly detail experimental methods, machine-learning models are often not as comprehensively spelt out. This has resulted in scientific studies using machine-learning models that are not being rigorously checked enough – and the positive results bias already endemic in research publications hasn’t helped the issue.

Similar to the problem of over-confidence in statistical tests within the scientific community, excessive optimism and hype around machine learning may cause researchers to overstate the results of their study. Furthermore, the publication bias to present positive results and the “buzz-wordy” nature of machine learning have exacerbated this problem. Without a universal guide of checks and balances to critique and analyse these studies, it is difficult to challenge these conclusions. That’s what this article lays out – a model info sheet for researchers to check their machine learning models before sending them off for publication.
Ensuring reproducibility and tackling data leakage
This model info sheet is essentially a comprehensive taxonomy of data leakage errors. These have been summarised to create a set of guidelines, which, if consistently applied, will help root out any potential reproducibility problems. The first and the most commonly overlooked issue identified in the study was data leakage between training and test sets due to a lack of clean separation. If the data set a machine-learning model is trained on is later tested on, the model has essentially already seen the “answers”. Therefore, the results of its predictions appear a lot better than they actually are, giving positive results that may not actually be there.
Another issue identified was that many models use features that are not legitimate, i.e., not relevant to the testing of the outcome. Researchers also found that the test set is often not drawn from the distribution of interest. For example, a model only trained on people over 60 cannot be applied to the whole population, as the model is trained on a narrower data set than the set used in testing.
One area that the article does not mention is a check for blind testing. Much like in clinical trials and research, blinding the individual processing data to different treatment groups would address problems that arise from human bias.
Other researchers and institutions are also helping tackle the problem of reproducibility in computational biosciences. The NeurIPS Reproducibility Program, headed up by researchers such as Joelle Pineau, is encouraging computational biologists to openly publish their code to ensure results from studies can be replicated. However, the problem of reproducibility remains a huge issue. Without a set of guidelines and regulations, this rapidly emerging field of research will experience a crash; and the legitimacy of machine-learning tools will face a reckoning.