The understanding of protein structure and function is central to our knowledge of many biological processes, yet a tiny percentage have had an accurate structure predicted using conventional methods. DeepMind’s AlphaFold, a recent artificial intelligence (AI) endeavour, has successfully predicted around 200 million structures – almost every known protein. However, billions of potential proteins from previously unknown organisms have been identified through large-scale metagenomic sequencing, and much remains to be learned about them.
In a bid to speed up the process of protein structure prediction even further and learn more about these mysterious proteins, Meta AI have developed their own AI technology to challenge AlphaFold. In this recent pre-print, Alexander Rives and his team describe a new language learning model and characterise over 600 million novel protein structures using metagenomic data.
Slow and steady is the standard
The current standard method of tertiary structure prediction from sequencing data begins with multiple sequence alignment of related proteins. This process takes sequences from multiple proteins and find areas of overlap – this data can be used to fill in any potential gaps and predict structure based on the properties of the constituent amino acids. Although an improvement on even older methods such as X-ray crystallography, this technology can be computationally strenuous and time-consuming, and the vast majority of known proteins do not have an accurate predicted structure.
The language of protein structure prediction
The constraints of the above methods highlight the need for a breakthrough in the field, and this has come about in the form of AI. DeepMind have already made history with their AI technology, yet AlphaFold still has its drawbacks – it can take up to a few minutes to predict one structure, and time is precious when there are billions of proteins to work with.
Meta AI’s attempt at efficient protein structure prediction is through the use of a language learning model. Language learning models are a form of AI that learn to predict patterns in language, filling in gaps and predicting what comes next. They can even go so far as to understand the meaning of words. The use of these models to predict protein structure is a novel concept based on the idea that there is an underlying pattern to how related proteins have evolved. By providing amino acid sequences to a language learning model as if they were words, the model should be able to make predictions about other sequences. As the model is exposed to more data, its capabilities grow – and there is certainly no shortage of data.
Meta AI have developed a language learning model that has successfully predicted a high-resolution structure for over 617 billion different proteins – tripling the catalogue of known protein structures in just two weeks. The state-of-the-art technology was capable of performing this momentous task without using the standard multiple sequence alignment step or needing to search databases for related proteins, significantly decreasing the monetary, computational and time-related costs of protein structure prediction even more than AlphaFold already has. The use of metagenomic DNA from a variety of sources such as the human microbiome and soil meant that the range of sequences used in the initial trial were diverse, and the resulting proteins mostly novel, native to previously unknown unicellular organisms. This contrasts with AlphaFold’s catalogue which is mostly made up of similar structures of known proteins. Describing the findings, research lead Alexander Rives commented that “they offer the potential for great insight into biology.”
What does this mean in practice?
Proteins exist with strict constraints on their structure – namely hydrophobicity, charge and size. A mutation that alters the amino acid sequence and its properties can have huge implications on a protein’s structure, causing it to misfold and potentially lose its normal function. The proteins we see today are the result of millions of years of structural evolution, and our understanding of the function of proteins has huge implications for all aspects of life, including in health. Cystic fibrosis, for example, is caused by malfunction in the CFTR protein due to misfolding.
Meta AI aim to characterise billions of proteins with their new technology, which will improve research in relevant fields, and the team has published all of the predicted structures in an open-source Atlas for use in research. As more sequences are provided, the technology will become even more efficient, potentially revolutionising protein research as we know it. However, it is important to note that the accuracy of the technology is not yet confirmed.