Mobile Menu

FAIR data applied to AI & machine learning

The FAIR principles should be applied to both human- and machine-driven activities. Humans have an understanding of the meaning of a digital object because we are able to interpret a variety of contextual cues, such as structural and visual prompts from a website page or the content of narrative notes. However, humans are limited because we are unable to operate at the scale and speed needed by the vastness and complexity of modern scientific data. Therefore, we have become increasingly reliant on machines to carry out data discovery and integration tasks.

A Guide to the FAIR Principles in Biopharma

FAIR data and AI applications

A computational agent is capable of making decisions in its environment about its actions. Artificial intelligence (AI) is the study of computational agents that act ‘intelligently’. This means that they need to be capable of autonomously exploring global data, regardless of type, format or access protocols. Therefore, it is unsurprising that poor data management and governance are often major barriers for the adoption of AI in organizations.

A digital object can be described as ‘machine actionable’. This means that the data provides enough detailed information for an autonomous computational data explorer to determine its usefulness and take appropriate action in a similar way to a human. Essentially, this means that a machine can make a useful decision about data that is has not come across before. Implementation of FAIR principles can benefit the machine actionability of data and address data quality challenges.

Challenges of applying FAIR to AI

Often, science workflows now involve AI across several different stages of the data lifecycle, from simulations that generate data to drawing conclusions from data. However, there are several challenges, which continue to be overlooked, that are faced when making science data FAIR for AI.

Data search

The AI community may struggle to search for data. This is because advancements in AI depend largely on well-characterized training datasets, which means that researchers require an understanding of the relevant metadata used for a search. However, current metadata standards only include attributes such as discipline domain, source and author. But AI researchers need information about the structure, sparseness and multimodality of the data, along with the types of models trained on the data.

Solution: Metadata, provenance and annotations are all crucial for making data FAIR for AI. On a small-scale dataset, these could be improved through a feedback loop of discoveries from the AI community, leading to updates on the metadata. This may make it possible to associate learned information with the dataset. For the feedback loop to be successful, full visibility of the workflow and provenance of the datasets must be provided.

Data accessibility

AI applications may present problems in terms of data accessibility, as algorithms may need to be trained across several data repositories that are geographically dispersed. Allowing certain AI algorithms access to particular data stores at scale is often impossible. Therefore, these datasets must be restored to a file system to enable access, but this can limit the ability of AI interpreting data from different repositories.

Solution: The automated collection of metadata, provenance and annotations at scale would reduce researcher burden and improve the quality of data. The FAIRness of data collected would significantly increase for AI if metadata, provenance and annotations were machine actionable. ‘Smart’ data collection would eliminate problems surrounding the storage of different formats of data with varying access policies and lower barriers of the data collection process.

Lack of standardisation

A lack of standardisation can also present problems, whether this be in terms of ontologies, vocabularies or types of data. In particular, different sources of AI applications can introduce hidden biases, even if the same type of data is used. This may be reduced if different types of data are integrated, but standardizing the metadata would still remain a difficult problem, and there is no FAIR way to do this yet.

Solution: Teams producing FAIR data need to be linked withdata science and data management experts to ensure that the data is ready for AI. Collaboration on the best AI practises, metadata standards, ontologies and data sharing is necessary to ensure optimal alignment. Therefore, researcher engagement with AI experts is critical for the application of FAIR principles throughout projects – from experimental design to final publication.

Considerations for Starting a FAIR Library

FAIR data and AI in biopharma

It is estimated that big data and machine learning in biopharma generates up to $100billion per year. This is because the technology enables better decision-making, improved efficiency of clinical trials and new tool creations for regulators. Some specific applications of machine learning in pharma and medicine include:

  • Disease diagnosis: Many companies are using AI to research and develop diagnostics for disease. For example, IBM Watson Genomics is driving innovation in precision medicine by integrating computing with tumour sequencing, and Berg is using AI to develop oncology treatments.
  • Personalized treatment: Supervised learning allows physicians to select for a more individualized set of diagnoses or estimate patient risk based on genetic information.
  • Drug discovery: Machine learning is used in various stages of drug discovery, from the initial screening of drug compounds to predicting the success rate of a therapy. For example, MIT Clinical Machine Learning Group is focussed on using algorithms to better design effective treatment of diseases, such as Type 2 diabetes.
  • Clinical trials: Machine learning can help to identify candidates for trials, enhance predictive analysis and increase the use of electronic medical records to reduce data errors.
  • Electronic health records: Technologies using machine learning can help to advance the collection of health information. MATLAB’s machine learning handwriting recognition helps with diagnostics, clinical decisions and personalized treatment suggestions.
  • Epidemic outbreak prediction: The monitoring of epidemic outbreaks is benefitting from machine learning technologies that use data from satellites, historical information on the web and real-time social media updates.

In recent years, there has been a huge growth of interest surrounding the deployment of AI within the pharmaceutical industry – most notably probably in the field of image analysis and disease prediction. But the success of AI has been cumulative, with better results originating from better input data. The Driving FAIR in Biopharma report includes advice from FAIR pioneers about the success of AI in biopharma, such as:

“Structure the data first – get the foundational data into the model but be wary of the labelling applied becoming out of date. For instance, with drug indications and adverse events, an important consideration is applying living labels to keep track of evolving information”.

For more, download the report:

Hero Image credit: B Violino

More on these topics

Big Data / BioData / Health Data