At the Festival of Genomics and Biodata, we are lucky enough to be joined by some of the biggest names in the business. In January, we sat down with a few of our esteemed speakers to chat about their backgrounds, roles and the work of their organisations. In this interview, we speak with Henning Hermjakob, Head of Molecular Systems at EMBL-EBI, about the future of multi-omics, big data and his reasons for attending the Festival of Genomics and Biodata.
Please note the transcript has been edited for brevity and clarity.
Interview originally conducted by Miyako Rogers.
FLG: Can you first tell us a little about yourself, your background and how you came to lead the Molecular Networks team at EMBL-EBI?
By training, I am a bioinformatician – I was kind of in the ‘Year Zero’, when you could start to study bioinformatics. I studied in Bielefeld in Germany. After working for a year in Germany, I joined the European Bioinformatics Institute in Hinxton, Cambridge, UK. I have been there ever since. I planned for the usual ‘three academic years and then move on.’ And that was about 25 years ago!
FLG: You spoke in your talk at the Festival about Reactome. Can you give us an overview of the Reactome database and explain how it is a valuable resource for researchers studying biomolecular pathways?
Reactome is a manually curated database of biomolecular pathways. And the key point is that it allows you to better zoom in on the analysis of your high-throughput data, into areas which might be most interesting for further research.
As with most databases, it does not give you the perfect answer. But if you must look for a needle in a haystack, it tells you which corner of the haystack is best to look at. It might even give you a magnet. In more practical terms, you can have a set of a few hundred over-represented, or otherwise selected, genes of interest. And in the simplest analysis, you can cut and paste them into Reactome. Within three seconds, you get the result of a gene enrichment analysis. It’s so fast that you can easily explore multiple gene sets.
We also offer more advanced analyses which can take longer but consider quantitative results. This also allows us to, in one run, analyse the results from multiple experimental modalities. For example, from transcriptomics and proteomics in the same analysis.
FLG: Could you also talk about some of the other resources that your team provide for systems biology and how they are used by researchers?
My team provides a set of resources which have increasing levels of abstraction.
The IntAct database of protein-protein interactions is a primary database, which records the data from the literature either extracted by expert curators or submitted by users. It provides a large-scale network of molecular interactions.
At the next level of abstraction, we have the Reactome database, where you link context to molecular interactions, and you define the directionality, you get into the pathway space, and that’s what we record in Reactome. Reactome is a secondary database in the sense that it contains conclusions of scientists from the analysis of primary data.
Finally, if you take a pathway, and you add a mathematical framework around it – rate constants, concentrations and differential equations – that describes what is happening in the transformations along the network, then you ideally get a predictive systems biology model. This is the data we collect in the BioModels database.
Of course, there are interconnections between these three resources, so we aim to provide network context for the analysis of large-scale data.
FLG: Could you explain how a multi-omics approach helps better enhance our understanding of biological systems?
It simply helps you to look at the same experimental system from more than one perspective. Whilst there is a big discussion in the literature about how much the proteome is predetermined by the genome, by the transcriptome, it’s clear that it’s not a complete dependency.
If you can analyse your data, both based on transcriptomics and proteomics, you get a broader view. Providing a resource that allows you to do analysis rounds of both modalities, and multiple modalities, in the same context simply provides an efficient way of gaining an overview. For example, it may show you that a certain pathway is highly upregulated based on your transcriptomics data but is downregulated or remains constant based on proteomics data. That is something which you might want to pick up in your further analysis.
FLG: What do you think are some of the current challenges and limitations in multi-omics research? We’ve heard a lot about spatial, we’ve heard a little bit about temporal and people trying to add more and more context, but what do you think are the limitations there?
For Reactome, the biggest limitation is that we are representing a kind of generic cell. And this is what we curate for. This is what the data represents, what the analysis is meant for. Of course, a lot of the research was focused on specific cells, single-cell data now. It would therefore be impossible to do the very labour-intensive and time-intensive careful manual curation in Reactome, on a sufficient scale to even represent only the major tissues of the human body. So, our philosophy is that we provide this reaction space for a generic cell, which you can then specialise through providing context-specific, tissue-specific, cell-type-specific expression data, and doing a projection of this generic reaction space into your specific context by expression data. That approach has limitations – I would say that is our biggest challenge and an ongoing topic of research is how we do this best.
FLG: Your team develop data infrastructure resources, like the Omics Discovery Index. Could you tell us about those and also the importance of data standardization in the age of big data?
These resources are aiming to support the FAIR principles by making data better discoverable in the case of the Omics Discovery Index (OmicsDI), and by making data stably referenceable through identifiers.org. We are aiming to provide a single discovery space for data from multiple omics modalities. In the Omics Discovery Index, we index data from multiple resources across proteomics, transcriptomics and metabolomics to overcome siloing, which often happens in the specialised resources, because each resource needs to provide specific features. You cannot have just one giant database in which you lump all data.
Despite this, scientists conducting multi-modal experiments often face the challenge of having to split their data across various repositories due to siloing. However, with OmicsDI, we can now view all the different datasets in context, including those associated with a particular publication. This allows the user to search modalities to zoom in quickly on relevant data.
But we are not aiming to collect the actual data, which would be a humongous task. We not only collect the meta-data from providers at the European Bioinformatics Institute but also beyond, from four continents (currently 24 databases), so that you have a comprehensive discovery capability across many resources relevant to multi-omics research.
FLG: Going back to big data – how do you see multi-omics research benefiting from the use of big data? And also some of the limitations of trying to mine that raw data and turn it into informative data, and how AI and ML can help?
This is a core topic here at the Festival. It is impossible to bring all the data together in one place for analysis. So, the analysis needs to come to the data. This is offered by many providers and speakers here, or at least explored, where federated systems that allow you to ask the same question, to run the same algorithm across many data sets in many different locations, are being developed. This is a way to address A.) that the larger the data is, the more immovable it becomes, and B.) the privacy concerns. For example, sometimes a hospital says the data cannot leave the premises of the hospital, so instead, you can bring the analysis to the data.
FLG: Where do you see the field of multi-omics systems biology going in the next 5-10 years?
If we reflect on the predictions made five years ago, most of them would likely have been inaccurate. It’s similar to attempting to predict the future through a crystal ball.
I think that multi-omics research, on a single-cell basis, is where lots of interesting data will come from in the future. Not only a single cell, in resolution, but a single cell in its tissue context. This is where we will see lots of cool data.
FLG: Thank you. So finally, I hope you enjoyed the Festival so far. What is the reason you attended and why do you think other people in this space should come to the Festival?
This is my first time attending, and I came on a hunch. Because based on the title, it doesn’t seem directly related to my main focus, which is primarily on proteomics-oriented and network resources. But it is a hugely interesting event. It was an eye-opening experience to see the impacts of genomics research in the clinic. It is remarkable to see how much progress has been made. It’s impressive to see how research on big data is now making an impact in clinical practice and it’s great to witness this progress.