There is a constant growth in the volume, complexity and creation speed of data. Therefore, humans are increasingly relying on computational systems to find, access, interpret and re-use data, requiring urgent improvement of the data landscape infrastructure.
The biopharma industry has been, and continues to be, transformed by emerging technologies that unlock data-rich workflows. Artificial intelligence (AI) now enables problems to be solved in a multi-dimensional fashion, so diseases can be understood in a more accurate and realistic way. Of course, this presents a new challenge – the rise of huge amounts of data in widely dispersed sources.
Essentially, if data meets the principles of Findability, Accessibility, Interoperability and Reusability, it can be referred to as FAIR. In 2016, a paper called ‘The FAIR Guiding Principles for scientific data management and stewardship’ was published by Mark Wilkinson et al. It was the first publication to formally discuss the FAIR guidelines, which ultimately emphasize the capacity of machines to re-use data with little or no human intervention.
Front Line Genomics have compiled this page using insights from experts in the data discovery field, who are leading the transformative FAIRification efforts within biopharma. These FAIR pioneers have shed light on the progression towards a data-centric perspective and on the challenges that organizations may face when using a FAIR approach to data management. For more information from the contributors themselves, check out the Driving FAIR in Biopharma Report:
|What is FAIR and Why Does it Matter?||An introduction to the FAIR principles and a summary of why their implementation is important.|
|How to Make Data FAIR||Explanations of several FAIRification aspects, a step-by-step guide to FAIR data and examples of FAIRness in biopharma.|
|What are the challenges of FAIR data?||A summary of some of the most common challenges that are faced when transitioning to FAIR data.|
|A Data-Centric Future||An overview of how to drive a FAIR data change in biopharma.|
|Advice for starting a FAIR journey||Top tips from experts in the biopharma data discovery field about FAIRifying data and committing to those FAIR changes.|
What are the FAIR Principles and Why do they Matter?
Well-curated and deeply integrated repositories are used for certain types of important digital objects, such as GenBank (the NIH genetic sequence database) and the Space Physics Data Facility (NASA’s archive for non-solar data). These are critical resources that are constantly capturing and fine-tuning important datasets. They enhance scholarly output and provide support for both human and mechanical users. However, many data types that are generated by typical bench science do not fit into such repositories – but are still just as important. So, how and where should this data be stored?
Traditional data types are usually accepted by general-purpose repositories, which consist of a range of datasets in various formats – examples include DataHub and FigShare. They do not integrate the deposited data and have few restrictions, usually resulting in a hugely diverse repository. This can present problems for the discovery and re-use of data.
How can these types of data be searched for? Once the data is discovered, can it be downloaded? Can the data format be easily integrated with other data publications? Does the researcher have permission to use the data, and if so, who should be cited?
These questions highlight just some of the challenges the people involved in data discovery face, and provide evidence for the necessity of an improved infrastructure for scholarly data.
FAIR data principles
A diverse group of stakeholders came together at a workshop in 2014 called ‘Jointly Designing a Data Fairport’, which took place in Leiden, the Netherlands. The participants represented academia, various industries, funding agencies and scholarly publishers. Each person had the intent of creating guidelines for improving the reusability of data. The aim was to overcome some of the barriers facing data discovery, to allow publications to be accurately found, re-used and cited over time, by both human and mechanical stakeholders. In the end, the FAIR data concept and principles were designed and endorsed.
A summary of the FAIR Data Principles that were defined by Mark Wilkinson, et al. (2016). Image credit: Elsevier, 2019
The FAIR principles provide guidance for scientific data management and are relevant to all stakeholders in the current digital world. The guidelines can not only be applied to conventional data, but also to the algorithms and tools that led to the data generation. The principles are related, but also independent.
The Group of Twenty (G20) is a forum of multiple governments from 19 countries and the European Union. The consortium address major issues related to the global economy including financial stability, climate change mitigation and sustainable development. At the meeting of G20 in 2016, which took place in Hangzhou in China, leaders issued a statement that endorsed the application of FAIR principles to research.
Why is FAIR data important?
Historically, flaws within digital world surrounding scholarly data publication have prevented the maximum benefit being gained from research. The FAIR principles promote good data management and integration. This is crucial for knowledge discovery, innovation, and the re-use of data after the publication process in order for it to remain useful.
Unlock productivity through internal data sharing
FAIR data that is shared and integrated will enable scientific queries to be answered more rapidly and in a more flexible manner. Therefore, data stewardship should not only consist of proper collection, annotation and archiving, but also include the idea of long-term care, leading to greater collaborations and more efficient workflows. Specifically for biopharma, harnessing the power of AI on FAIR data will accelerate the creation of valuable datasets, increasing the productivity of drug pipelines. Also, the availability of high-quality real-world data will lead to rapid innovation in personalized medicine.
Researchers gain credit for their data
Advocates of FAIR data have reported that the FAIR concept has been increasing among various institutes, enabling researchers to gain credit for their data and interpretations. By using the principles as a guide, producers and publishers can navigate obstacles surrounding scholarly publishing and add value to their own digital research projects through effective annotation and citation.
Enable ‘Machine actionable’ data
Computational agents are becoming increasingly relevant and FAIR principles provide steps for a digital object to increase its ‘machine actionability’. This means that the data provides enough information for a computational explorer to identify the type of object, determine its usefulness and take appropriate action, in a similar way how a human would do. FAIRification uses machine actionability to reduce the costs and risks towards data discovery. Not using FAIR data costs more than €10 billion per year to the European economy.
How to Make Data FAIR
Step #1: Retrieve and analyze non-FAIR data
- This data needs to be fully accessed and then examined in terms of structure and differences between data elements, such as identification methodologies and provenance.
Step #2: Define semantic model
- Choose community- and domain-specific ontologies, along with controlled vocabularies to describe the dataset entities. This should be done in an unambiguous way and in a machine actionable format.
Step #3: Make data linkable
- The semantic model is applied to the data to make it linkable. This can be done using Semantic Web or Linked Data technologies.
Step #4: Assign a license and metadata
- Assign a license and metadata: A data license is needed to allow users to determine the terms of the data re-use. The data needs to be described by rich metadata to ensure the FAIR principles are supported.
Step #5: Publish FAIR data
- The FAIRified data can be published alongside the relevant license and metadata. The data can now be indexed by search engines and accesses by users, even if authentication and authorization is required
Considerations for Starting a FAIR Data Library
What is metadata?
Metadata describes a dataset’s context, quality, condition and characteristics. It provides information about other data to make working with particular datasets easier. It can be created manually to be more accurate, or automatically to provide a larger amount of basic information. Applying FAIR guidelines to metadata can facilitate the discovery of data, even if the FAIR publication for the data itself is absent. Additionally, FAIR principles can also be applied to the publication of non-data assets, such as analytical workflows, to enhance their chances of being discovered and re-used.
Dr Isabella Feierberg, at AstraZeneca, explained: “FAIR metadata is really important to us in drug discovery. It allows us to make sense of that data that we have and to make reliable models. We have been on a journey in AstraZeneca lately to make our metadata FAIR – this initiative particularly focused on the FAIRification of historical assay data, including their protocols.”
- Descriptive: Information for identification and discovery.
- Structural: How the data is organized.
- Administrative: Technical source, data type and creation process.
- Reference: Content and quality of statistical data.
- Statistical: Information about the data, such as distribution and outliers.
What are persistent identifiers?
A few years after online research has been cited, it is likely that it may go ‘dead’, but using a persistent identifier can delay this process. These are long-lasting references to a digital object. They usually have two components – a unique identifier to ensure the provenance of the resource, and a service that locates the digital resource despite location changes. Persistent identifiers help to overcome problems such as journals being transferred to new publishers, companies reorganizing their websites or lost interest in older content.
Most repositories will assign a persistent identifier when archiving a dataset, so using a well-suited repository is crucial. They are used within a repository to manage cataloguing, describing and access to digital materials. A persistent identifier is also added to the metadata.
Here are some examples of persistent identifier schemes:
- Digital object identifiers (DOIs): A unique string of numbers, letters and symbols used to permanently identify a piece of research and link it to the web. DOIs begin with 10 and the prefix is a number of four or more digits that reference the organization.
- Uniform resource locators (URLs): A reference to a resource that specifies its location. They occur most commonly to reference web pages (http) and consist of multiple parts, including a protocol and domain name.
- Persistent uniform resource locators (PURLs): PURLs are URLs that redirect to another web resource using standard HTTP status codes.If a web resource changes its location (and its URL), a PURL can be updated as it always uses the same web address. Therefore, PURLs maintain hyperlink integrity.
What are authentication and authorization procedures?
The data needs to be accessible and persistent identifiers allow it to be retrieved. Accessibility is defined by a machine being able to automatically understand the requirements – it doesn’t always mean the data is open or free. So, if necessary, the access procedure includes authentication and authorization steps.
Authentication is the security practice of confirming that a user is who they claim to be. Authorization is the process of determining what level of access each user is granted. Login details and passwords are used to authenticate users online, with some applications required much stricter authorization than others, such as a face ID scan. Once a user has been authenticated, they can only see the information that they are authorized to access.
How to make data interoperable
Research data should be easily combined with other datasets and workflows, by humans and computational systems, to speed up its discovery. To make data interoperable:
- Well-known formats and relevant standards for metadata should be used.
- Controlled vocabularies, keywords and ontologies should also be prioritized. Ontologies are the structural frameworks for organizing information by extracting relevant data. They are used in AI, software engineering and library science. Controlled vocabularies are an organized arrangement of words to retrieve relevant information through searching.
- A README file helps to ensure that data can be correctly interpreted and re-analyzed by others. It is a text file that introduces and explains a project – it contains information that is typically needed to understand what the research is about. It should have clarity, be structured and remain up to date. Repositories often offer to create a default README file for publications.
Furthermore, to ensure data is re-usable, documentation should be provided at project-, file- and item-level so that the digital object can be interpreted properly. Also, data usage licenses must be clear for others to know what type of re-use is permitted.
Examples of biopharma FAIR projects
ISA: This is a community driven metadata tracking framework. ISA provides FAIR structured metadata to Nature Scientific Data’s Data Descriptor articles, GigaScience data papers and the EBI MetaboLights database.
Open PHACTS: This is a data integration platform for drug discovery information. Access to Open PHACTS is both machine- and human-readable and it allows multiple URLs to be used to access the information. All the data is described using standardized descriptors and ontologies with rich provenance.
Worldwide Protein Data Bank: This is a special-purpose curated data archive that contains information about 3D structures of proteins and nucleic acids. All entries are machine actionable, conform to a data standard called Macromolecular Information Framework, represented by a DOI and are linked with metadata that is searchable on a number of affiliated websites.
UniProt: This is a comprehensive resource for protein sequences and their annotated data. All data in UniProt are uniquely identified with a stable URL and a record containing rich metadata, which is machine- and human-readable. The format uses standardized vocabularies and ontologies, and links to over 150 other databases.
More examples of life science companies that are making data FAIR can be found at the FAIR Toolkit website.
What are the Challenges of FAIR Data in Biopharma?
Despite the fact that creating and using FAIR data has the potential to greatly enhance drug discovery and development, adoption of the principles has been slow in the biopharma industry. It is recognized that FAIRifying data is a progressive process. Although the proposition of universal FAIR implementation being very high value, a number of signification cultural and technical barriers remain. These likely limit the full transition to better data management activity in several organizations.
Common challenges faced when implementing FAIR:
Data is sometimes not tagged, contains wrong identifiers or lacks common terminology. Dr Lawrence Callahan, at the FDA, explains: “We have different problems to many other pharma companies because we are at the back-end of the drug discovery process. We deal with so much different data of a wide variety of formats – now, it comes in mostly as PDFs, which is not much better than paper when trying to organize data. We then have to organize this data based on what the substance is.”
Trapped historical data
Technologies that were used in previous research may no longer be supported, or individuals who were responsible for creating the original datasets may have moved organization.
There are often multiple possible ontologies that could be used – little standardization exists across industries, or even with organizations. Janet Cheetham, at Amgen, explained: “We recognized that we were calling the same item by 35 different names because we could, which really limited our ability to do higher-level analytics. This is why we need to apply formatting and ontology standards to specific processes or experiments”.
A cultural shift is required, from a one-dimensional data management approach to a wholesome data sharing effort. This requires persistence from both senior management and data scientists.
Resource allocation and training in FAIR programs are necessary. Many organizations may not be capable of covering the large costs. Dr Ellen Berg, at Eurofins, stated: “We have got very positive customer feedback and we are using our data to launch new services – but it’s still unsure as to whether we will get a return on investment. Nevertheless, it is important that we support standardized metadata and ontologies in our area (in vitro pre-clinical phenotypic assays) so that the whole industry can gain, not just ourselves”.
Implementation of FAIR in biopharma has been particularly slow due to the lack of a predetermined path. The industry presents numerous obstacles because new knowledge is always reshaping the information landscape. Moreover, the culture shift from owning and using data once to re-using data is often counterintuitive for many scientists in the biopharma industry. Many researchers have been effectively conditioned to avoid sharing information, presenting an uphill battle for the transition to FAIR data.
Why is implementing FAIR principles harder at scale?
Whilst a common base set of FAIR metrics may be applicable globally, most research communities need to define them further based on their discipline. Also, companies still seem to be struggling with the FAIR metrics due to their technical nature, meaning that they are not accessible to employees who do not possess certain data-related expertise. Therefore, the positive impacts linked to being a truly data-centric and FAIR-data driven organization have not yet fully been recognized.
This makes measuring FAIRness challenging. Metrics, by definition, are measurements of quantitative assessment. They should provide a clear indication of what is being measured and define a reproducible process for obtaining that measurement. The implementation of FAIR data may not suit such a rigid metric system, but instead, could benefit from encouraging genuine progress towards all of the FAIR principles using a model that rewards different degrees of FAIRness.
In 2019, the FAIR Data Maturity Model Working Group at the Research Data Alliance (RDA) brought together industry and academic stakeholders to establish unambiguous goals for evaluating FAIR digital objects. In the end, the team recommended a common set of core assessment criteria for FAIRness – the FAIR metrics. Essentially, a list of indicators were assigned to each principle and evaluation methods for each were outlined. They can be used for self-assessment or could be scaled-up and adapted to meet the specific needs of an organization.
FAIR Data Applied to AI & Machine Learning
A Data-Centric Future
The FAIR principles provide a set of targets for data producers and publishers, guiding the implementation of good data management schemes. These should include a strategy that focusses on data centric information systems that facilitate the maximum amount of value from the data and respect the FAIR guiding principles. This could take place in a number of different forms, such as a data lake or a data warehouse, based on the specific needs of an organization.
Data Lakes vs Data Warehouses in Biopharma
Driving a FAIR data change in biopharma
In order to support better data management and FAIRness in biopharma in the coming years, there are a number of questions that need to be asked:
What technical elements are needed to apply FAIR?
- Master Data Management (MDM): Need to be involved for the modelling of URLs, DOIs and ontologies. MDM is a technological discipline in which business and IT work together to ensure that the organizations official shared master data assets are uniform, accurate semantically sound.
- IT partners: Must be involved for authorization and authentication steps.
- Data producers: Required for the standardization of the data itself.
- Data security: Need to be involved for governance on data usage and data privacy.
How FAIR should the data be?
It is important to recognize that FAIRification is an expensive and long-winded process due to the stewardship and curation resources. Also, not all legacy data will be used – remember no organization can be 100% FAIR. This means that it is beneficial to invest wisely to maximize return on investment. To make informed decisions about prioritizing certain data landscaping, it is useful to carry out a full assessment of the business value of data sources. Generally, for internal biopharma data, interoperability and re-usability are the FAIR principles that have the greatest impact on data management.
Dr Feierberg discussed how the FAIRification of data at AstraZeneca was progressing: “I wish I could say that we are 100% FAIR, but that’s not the case, of course. FAIRifying data is an ongoing process, and once you do make the data FAIR, new business questions will be asked meaning that additional FAIRification needs to happen.”
How are FAIR gains going to be made?
A number of initiatives are now available for the FAIRification of data. The suitability of each depends on the organization and type of data being managed.
- Incentives: To increase the sharing of FAIR data, incentives could be offered for data being uploaded into public repositories – these need to be accessible enough for regular users, not just for expert data scientists.
- Pre-competitive collaboration: This involves companies within the biopharma industry working together to address the data management problem. There is a need for the creation of a public repository for procurement and supplier data because individual organizations carry out this work at the moment.
- Clear standards: Regulators need to define what data is considered sufficiently FAIR and provide clear targets on how data should be delivered. This would push organizations to be more consistent in their data outputs (FAIR metrics are discussed below).
- Demonstrate value of investment: The value of FAIRification investment should be widely demonstrated. Investing in the right people and the right approach is crucial for the implementation of FAIR principles – hardware and software for onsite data storage, underlying the conceptual framework and a dedicated team with expertise in data stewardship and quality control are all necessary investments.
Advice for Implementing the FAIR Principles
The Driving FAIR in Biopharma report is based on anonymous excerpts from expert contributors. These include Chief Scientists and Directors from a variety of organizations, such as AstraZeneca, Eurofins and the FDA. For more information and top tips for starting a FAIR journey, check it out:
Here is a summary of their advice for starting a FAIR journey in the biopharma industry:
- Use the most commonly accepted Uniform Resource Indicators (URIs): A URI is a type of persistent identifier. ‘Identifiers’ is a useful resource for providing URIs for life science data.
- Use controlled vocabularies: This will enhance the findability of data. Consider the current and future use of the data before selecting which vocabularies to use, involving both end users and domain experts.
- Re-use existing ontologies: Re-using ontologies reduces the number of required staff librarians and decreases the costs involved for maintenance. Ontologies can be adapted with individual terms – search for the closest fit and adapt the work with specific needs.
- Do not over-engineer ontologies: If creating new ontologies is absolutely essential, start small. Over-engineering tends to lead to failure at this stage, so begin modestly. Determining the best ontology for a biopharma company and analyzing the data it generates requires expertise in the relevant science domains.
- Focus on interoperability: Interoperability should be considered in the design phase, not just later down the line.
- Connect with existing FAIR-aligned repositories: Resources that are connected to existing FAIR-aligned repositories allow for the integration with related datasets and support the development of standards. Also, many domain repositories offer specialized tools for extraction, visualization and documentation. Familiarize yourself with GitHub, a standard data model for health data.
Committing to the FAIR principles in biopharma
Although numerous challenges face the implementation of FAIR principles in biopharma, a commitment to change the traditional ways of data management in the industry will lead to a huge positive shift. For managers of these organizations, the concept of sharing information and re-using data is alien and defies decades of business acumen. For researchers, writing and publishing unique and novel papers has always been the goal, with data always being generated for a single purpose.
However, now a shift in mentality is being asked for. Scientists need to view their data for use beyond their original purpose and managers need to release their historical conceptions about data confidentiality. By incentivizing these efforts, FAIR data will ultimately allow strategies to be developed and enable a much more beneficial output of the drug development pipeline.
Open Resources for FAIR
FAIRification is an evolving process, particularly in the biopharma industry – it may seem complex at first, but it is critical to start the transition sooner rather than later. There are a growing number of resources available to offer assistance. Their expertise can help to transform unstructured data into structured data that is actionable and reduce any obstacles in the way of implementing FAIR principles.
Implementation and relevance of FAIR data principles in biopharmaceutical R&D
A design framework and exemplar metrics for FAIRness
Interoperability and FAIRness through a novel combination of Web technologies
What does the FAIR Future Hold?
The FAIR principles act as a guide to assist data publishers in making their digital research artifacts findable, accessible, interoperable and re-usable. Adhering to the guidelines as much as possible enables a broad scope of data exploration and allows valuable research to reach its full potential. FAIR data principles present the opportunity to reduce the time and money that scientists spend on manual, laborious data preparation tasks. Instead, effort can be purely devoted to data analysis – who knows what that could make possible?
Don’t forget to follow Front Line Genomics for more information about how genomics is being used to benefit patients.
Image credit: 123RF