The use of big data in drug research has reached an all-time high in recent years. The demand for faster and more efficient data access has encouraged the growth and innovation of data repositories.
Historically, data management systems for biopharma companies were based heavily on structured paradigms that assumed the data was being captured for a specific question in mind. Biopharma companies have spent a lot of money on trying to resolve issues linked to the big data boom, many of which have not been successful. But now, the innovative use of emerging technologies is starting to allow the capture of data first, before knowing what it is going to be used for, followed by its efficient storage. The FAIR principles are helping to guide this transition.
A Guide to the FAIR Principles in Biopharma
Below, databases, data lakes and data warehouses are described. Also, an example of how Merck used the FAIR principles to cleanse their data lake is explained.
Databases
Databases are organized collections of data on a computerized system, enabling the search, selection and storage of information. A database stores real-time information, is usually accessible in various ways and is controlled by a database management system (DBMS). Since their inception in the early 1960’s, they have evolved dramatically. Today, cloud databases and self-driving databases are shifting the way that data is collected, stored, managed and utilized.
Differences between data lakes and data warehouses
There are several key differences between a data lake and a data warehouse, making each model unique. Depending on a company’s needs, both may be necessary.
Data lakes | Data warehouses | |
Data structure | Store raw and unprocessed data. They require larger storage capacity, more maintenance and have an increased cost. | Store processed and defined data. |
Purpose of data | Data purpose is not yet determined. Therefore, they have less organization and less storage space. | Data is currently in-use. |
Users | Hard to navigate due to the unstructured data, so data scientists are usually required to translate them for a specific use. | Processed data that is stored consists of spreadsheets and tables, which all employees can understand. |
Accessibility | Architecture is easy to access and change with few limitations. | More difficult to manipulate due to their structure, but this makes the data itself easier to explore. |
Data warehouses
Data warehouses are central repositories that pull information together from many different data sources within an organization. They store historical data, which is typically processed and defined. Once the data has been through formatting and import procedures to match what is already in the warehouse, the systems can be used to reveal business intelligence and enhance data-driven decisions.

Data warehouse implementation: The data source is determined, followed by Extract, Transform, Load (ETL), which provides optimized data loading processes without losing data quality. The Enterprise Data Warehouse (DWH) is then implemented and deployed. How this happens varies, depending on the number of end-users and how they will access the data warehouse system. In the end, Business Intelligence (BI) tools are improved due to the collection and extraction of data from any source at all stages of the business. Image credit: E. Lisowski, 2021
Data warehouses are actively being implemented in the healthcare industry for predicting results, reporting on treatment and sharing data with research institutions. In particular, cloud data warehouses are increasingly being used to offer collaboration and accessibility from anywhere, features that weren’t available just a decade ago. For example, a cloud data warehouse for clinical trials enables unprecedented data gathering and access. Identifying participants, implementation of standardized protocols, integration of genomic data for novel biomarkers and building an interoperable network have all been made possible with this new data technology.
Data lakes
Data lakes are centralized repositories that store structured and unstructured data at any scale. Since the first data lake was developed in 2010, the number of companies using them has increased exponentially. Organizations that implement a data lake have been shown to outperform companies that don’t by 9% in organic revenue growth. This is because they enable new types of analytics to identify opportunities for business growth, increase productivity and enhance informed decision making.
However, data lakes are susceptible to a few common difficulties, one of which is the ‘small file problem’. This is where files containing a small amount of data arise and become inefficient to run computations over or keep metadata statistics on. Regular maintenance is a viable solution to compact data into an ideal size, allowing efficient analysis. Also, without proper schematic enforcement and thoughtful workflows, data lakes can transition into ‘data swamps’. These have little organization and no curation, making them largely unusable. Nevertheless, data lakes that incorporate FAIR principles avoid this and tend to be beneficial, particularly for healthcare companies, due to the storage of a combination of structured and unstructured data.
Phillip Ross, a director of data science at Bristol-Myers Squibb, explained: “Data lakes are hard to do right at pharmaceutical companies. Nevertheless, converting disparate data storage systems into large repositories is explosive right now. We want to make sure that we are at the forefront of this evolution.”
Amgen, a biotechnology company, is currently coordinating two data lakes – one covering process development, quality control and manufacturing, while the other consists of clinical trials. Janet Cheetham, leader of the digital and data transformation program, said: “When considering data lake architecture, there are two main things to consider; (1) What sits beneath the lake – the storage system, and (2) what’s on top of the lake – the analytics. Cloud storage provides an infinite capacity at a very low cost, allowing information from drug development all the way to commercialization to be gathered in one place. However, the cloud does not address the challenge of maintaining data integrity and driving its value up – these problems can’t simply be solved with software either.” This is why FAIR principles are necessary.

Primary components of a data lake – storage, format, compute and metadata. Storage is where the data physically resides. Format influences the available options and compute engine compatibility. Compute is responsible for the crunching of data. Metadata maintains data schema and can be enhanced to increase computational efficiency. Image credit: P. Singman, 2021
To find out more about data lakes check out the Driving FAIR in Biopharma report, which includes Ben Gardener’s (from AstraZeneca) thoughts on whether they hinder or benefit data management:
Cleansing a biopharma data lake
Dr Haydn Boehm from Merck has recently explained how the organization evaluated their biopharma R&D datasets using the FAIR principles. Merck’s data lake was over 50 years old and consisted of fragmented, unstructured and untrustworthy data, rendering much of it unusable. Essentially, a huge amount of time and effort was needed to source any information, and the lack of data quality was limiting the ability to train AI machines.
To ‘cleanse’ their data lake, the Analytical Information Mark-up Language (AnIML) data standard was used. It uses a format that is human- and machine-readable and supports data from multiple analytical techniques. It also focuses on accessibility and easy adoption. BSSN software was used to visualize, analyse and optimize the entire process.
As research continues to move towards an increasingly data-driven future, modern technologies, such as AI, will play a pivotal role in enhancing efficiency and productivity in biopharma. Powerful analytic tools require access to high-quality, properly annotated and managed data. Therefore, harmonization of data will enable the acceleration of a digital transformation within the industry. Tools similar to AnIML can help to drive data intelligence at scale, providing biopharma organisations with more efficient access to scientific data.
Image credit: C&EN