Making Space for Biomarker Data
Clinical research, particularly cancer diagnostics, has seen an explosion in the collection of biomarker data in recent years. With the advancements in personalised medicine, new treatment strategies have been developed. These involve profiling an individual’s clinical features, molecular profile, and details of the tumour microenvironment. While therapeutics and diagnostic technology are rapidly evolving, only a subset of cancer patients have benefited from these developments.
To address this, there is a strong push for identification of biomarkers – molecular or genetic indicators used to predict disease. Biomarkers are derived from a broad range of data, including genomic, proteomic and imaging data and they provide a holistic approach to disease diagnosis. In the case of cancer, this helps to combat issues of intra- and inter-tumour heterogeneity, as well as clonal evolution.
Cancer biomarker research is an extremely data intensive discipline. The massive increase in biomarker data provides an unprecedented opportunity to shift toward a data-driven paradigm of research. A recent paper in Cell suggests this could be achieved through adaptation of data collection techniques from the field of planetary science.
Lessons Learned from Planetary Science
Planetary science utilises petascale, heterogenous archives. This diverse range of data is collected, processed through data-generation pipelines and transported to international archives. Artificial intelligence and machine learning technology can then be used to identify key details missed by the human eye. Adapting these data driven methods for cancer research provides an opportunity to automate and accelerate the discovery of biomarkers. This will promote the identification of key features, anomalies and allow novel scientific insights.
All publicly funded data is required to be entered into the National Aeronautics and Space Administration (NASA) data archive and to follow specific data standards. However, this same deposition and sharing of data remains a challenge for biomarker research. For efficient data sharing, national access to large amounts of well-labelled data is needed. This data must also follow a set of standards which allow it to be linked to other databases and used by algorithms.
Applications to Biomarker Research
In 2000, The National Institutes of Health met NASA to explore data science solutions to biomarker data challenges. The Early Detection Research Network (EDRN) of the National Cancer Institute was chosen to test these applications. Data science technologies developed at NASA’s Jet Propulsion Laboratory were applied, resulting in the production of the EDRN knowledge system. This system has since been applied to a range of data types, from genomics to imaging. This has facilitated multi-omics data analysis. The EDRN has successfully developed large databases sharing well-characterised imaging and biomarker data for a variety of cancers.
The success of the Human Genome Project has prompted the onset of new large-scale projects, including the Tumour Cancer Genome Atlas. This project successfully coordinates analyses among centres, whilst retaining sample availability and promoting broad and rapid data sharing. Further advances are needed to coordinate the analysis of multi-omic data. Specifically, the combination of imaging, risk factor and outcome data are needed to fully characterise neoplasia development and outcomes.
The Future of Biomarkers:
Artificial intelligence and machine learning can only truly benefit the medical sphere if high-quality, well-labelled data is available. A national knowledge system of cancer biomarker data could produce substantial opportunities to advance cancer research. Furthermore, visualisation is extremely important for big data analysis. VR tools are also beginning to aid in the visualisation of complex 3D datasets. For example, 3D MRI scans allow much deeper analysis than 2D samples. Machine learning, when combined with human visualisation also shows promise. For example, early lesions which often lead to lung cancer can be identified by semantic features, but these analyses often lack consistency. Combining crowdsourcing tools from astronomy with traditional visualisation methods could improve the generalisability and availability of this data.
Overall, the authors recommend that the cancer biomarker community collaborate to develop a national strategy for data driven research. This involves the building of a national biomarker database, with specific data standards, coordinated algorithms and tools. Expansion and development of visualisation techniques for imaging is also advised. Additionally, retrospective analysis of biomarker data is recommended to broaden the database. Another fundamental requirement is that all data acquired by funded research must be deposited in the database.
Through the success of building such capabilities via the EDRN and other institutions, the cancer biomarker community are optimally placed to usher in this new era of data science.