Since the discovery in 1953 of the double-helical structure of DNA, and its role in encoding genetic information, genetics has exploded as science and driven the rapid development of the biotechnology industry and the use of genomics as a powerful source of health information. But to carry out its function, DNA sequences must be converted into messages that can be used to produce proteins, the complex functional molecules within our bodies. The linear DNA sequences encode and are eventually translated into linear sequences of amino acids which are the building blocks of proteins.
Proteins are involved in almost every important activity within a living organism, from fighting off invading pathogens, digesting food and building structures such as muscle fibres or hair, to providing oxygen to cells and acting as messengers between cells. Proteins can undertake this vast array of different functions as a direct consequence of their structure. Although they are composed of a string of amino acids in a particular order, they do not remain one-dimensional but instead, fold into complex three-dimensional shapes. Since there are 20 different types of amino acids, each with specific chemical properties, and proteins can range in size from tens to thousands of amino acids, a folded protein can display a vast range of chemical and functional characteristics. Understanding how and what shape proteins fold into in order to provide their exquisite functional specificity is therefore essential to understanding how organisms function.
The protein-folding conundrum
Although proteins are almost completely defined through their linear structure, it has been all but impossible to predict into what shape the protein will fold. In 1972, Nobel prize-winner Christian Anfinsen predicted that one day it would be possible to determine a protein’s three-dimensional shape based solely on its linear sequence. But for nearly 50 years this problem remained a grand challenge for biologists.
The problem is that a protein can theoretically fold into about 10300 different conformations, and it would take an impossibly long time for a protein molecule to sample every conformational space. It is tempting to assume that proteins fold into the correct conformation as they are synthesised, block by block, but this does not appear to be the case, and unfolded (denatured) proteins can almost always be coaxed back into their correctly folded state. In many cases, there are specific structures that can be identified within proteins such as a-helices and b-sheets that form secondary building blocks and contribute to the final structure.
But the key question remains. Out of all the possible configurations, how does each protein spontaneously fold into one particular shape, allowing it to carry out its specific biological role? Given the importance of the 3D structure of a protein, any attempt at rational development of proteins as therapeutics is often hindered by this problem.
AlphaFold
Every year, the organisers of Critical Assessment of Protein Structure Prediction (CASP), hold a competition to determine the most effective AI system for determining protein folding. The competition is straightforward: competitors are given linear amino acid sequences for 100 proteins and are required to predict their structures, measured against the known conformation.
Last year, a new AI system, AlphaFold, developed by London-based DeepMind, outclassed all opposition, successfully predicting the structure of the test proteins to within the width of about one atom. Previously, protein structures of about 3,500 human proteins had been painstakingly determined using experimental technology such as X-ray crystallography and NMR. But thanks to AlphaFold, the 3D structures for virtually all 20,000 such proteins are now known.
The AI of AlphaFold
Transformers are a neural network architecture being used extensively in Machine Learning systems since its introduction by Google Brain in 2017. AlphaFold’s development team created a new type of transformer designed specifically to work with three-dimensional structures. In simple terms, a folded protein can be visualised as a ‘spatial graph’ in which amino acids are the nodes, and edges connect components in close proximity. AlphaFold attempts to interpret the structure of this graph, while reasoning with the virtual graph that it’s building. The model is structured to maximize information flow through recursive hypotheses that create increasingly accurate predictions of the underlying physical structure of the protein and can determine highly accurate structures in a matter of days.
There are a few potential drawbacks. Because AlphaFold was trained on publicly available datasets of known protein structures it may not accurately predict the shapes of unusual new proteins. And of course, it does not reveal the mechanism or rules of protein folding for the protein folding problem to be considered solved from an academic perspective.
The future
DeepMind plans to release structures for nearly every protein whose genetic sequence is known to science, over one hundred million. The contribution of AI to structural biology, and most importantly to the design of innovative medicines, has already begun. At Plextek, we are looking to play our part by refining and developing Machine Learning and AI systems that use computational processes to improve performance and utility over time.