APBio: an exploration of protein synthesis with A.I.

13 min readMay 7, 2021

Introduction

The goal of this project was to explore the synthesis of novel proteins with the help of Artificial Intelligence and machine learning. To do this, we consulted the Ellington Lab at the University of Texas, where graduate students Danny Diaz and James Loy built a pipeline to do so.

Background

Proteins and basic Biochemistry

In order to understand the fundamentals of the project, it is important to understand some basic biochemistry. Amino acids are organic compounds composed of nitrogen, carbon, hydrogen, oxygen, along with a variable side chain group. Most amino acids have the same basic chemical structure, but the side chain differs between them, giving each amino acid its own unique structure. The human body needs 20 different amino acids to grow and function properly, and about 500 naturally occurring amino acids are known, although only 21 of them make up proteins. These different amino acids all have different chemical properties.

These amino acids can be strung together to form a protein. A protein, like an amino acid, can also have unique characteristics, but the functions of these proteins can be far more complex and interesting than an amino acid on its own. Some proteins can help speed up chemical reactions, while others make up the building blocks for living organisms. Hemoglobin and Insulin are two examples of proteins which many people have heard of.

The primary structure of a protein is a linear chain of amino acids which can be encoded by DNA. This chain is called the primary structure. However, due to chemical forces, these proteins then tend to fold or bend into unique 3D shapes which can help to give a protein its unique structure. Usually parts of proteins will form into either an Alpha Helix or a Beta Sheet, known as the secondary structure. These two shapes will then further bend into a conglomerate, making a protein shape called the Tertiary Structure. Finally, some scientists also refer to a Quaternary Structure, which consists of multiple Tertiary structures coming together and interacting with each other. Examples of this folding process are shown in the below image.

As mentioned before, the functions of these proteins are determined by the shapes which are formed by this folding process. For example, the flexible arms of antibodies are able to protect the human body from disease by recognizing, binding to, and targeting pathogens for destruction by the rest of the immune system.

Protein Families

Moreover, proteins with similar structure can be known to descend from common protein ancestors. This relationship between proteins is known as a “protein family.” In our project, we wanted to utilize the pfam database to explore the performance of 3D-CNNs when provided datasets with randomly-chosen proteins, proteins related in function (e.g. protein hydrolase), and proteins related by common ancestors (proteins in the same protein family).

The pfam database itself is curated manually and analyzed for homology using profile Hidden Markov Models (HMMs). Profile HMMs are probabilistic models built from curator-defined family-representative sequences. By applying the profile HMMs to the larger dataset, UniProt KnowledgeBase, pfam can find the entire collection of proteins related by common ancestry. In the case where a single profile HMM is not capable of encompassing the entire superfamily of proteins, multiple pfam entries can be released to represent the full sequence family.

Synthesizing and modern academic work

Until somewhat recently, proteins were difficult to create synthetically outside of the human body. In 1972 a major breakthrough occurred when Paul Berg developed a technique to cut and paste DNA into an organism. This led to the ability of scientists to force small bacteria to produce insulin in 1982. While there are many challenges with this process, the basic idea of protein synthesis remains the same; find a protein of importance, and obtain the DNA for that protein, and paste the code into a bacterium that will then produce the protein.

Now scientists and Engineers are exploring the idea of synthesizing novel proteins that do not exist in nature. This is primarily due to the fact that scientists believe that it is possible to improve the function of some proteins. For example, one protein of interest called 6IJ6 is an enzyme that breaks down plastic. The bacteria that produces this protein only needs a small amount of energy per day, so the enzyme is only capable of breaking down a minuscule amount of plastic per day. Teams are working to improve this protein and then obtain the improved genetic code so as to create a protein that could break down plastic on a much larger scale.

Application to Machine Learning

Dataset

Within the last 50 years, scientists have been able to collect three dimensional images of these proteins, many of which can be found in the RSCB Protein data bank (link in references). There are over 175,000 of these protein structures which are publicly available for researchers to utilize. There has also been a rapid increase in the number of structures available, giving engineers enough data to perform machine learning techniques.

Designing a self-Supervised Learning Task

The general goal of the pipeline created by the Ellington Lab is to use a Convolutional Neural Network (CNN) to analyze the images of the proteins and then design an improved protein from a specific family.

As mentioned earlier, the shape of a protein determines the function of the protein, and the individual amino acids within the protein determine the shape. Thus, in order to alter the shape (and therefore the function) of a protein, we want to understand how one amino acid will affect the shape of the entire protein.

The Ellington Lab pipeline Centers a microenvironment around an Amino acid. It then deletes the remaining protein atoms, deletes the centered amino acid and uses the amino acid as a label.

Formatting the Data

You cannot provide CNN a list of atoms and atomic coordinates. Instead you must create a discrete picture of the atoms in 3D space. We can create discretized boxes containing the Atoms in the mircoenvironment. We do this by putting the atoms in voxels (3D version of pixels) and running that through the model. Thus, when this comes together we can create an image classification problem to predict the amino acid at the center of the microenvironment.

CNN Architecture

A convolutional Neural Network is used in the APBio Machine Learning Model to learn the structure of proteins. Convolutional Neural Networks are very good at recognizing spatial data and these networks are often used for image recognition tasks. Because of the spatial importance of three dimensional models for proteins, a CNN is a great model for learning these structures. How it works is each convolutional layer in a CNN learns important features of the dataset. For example, the first convolutions will learn very abstract features of a protein such as whether it is mostly concentrated in a single area or if it encompasses the entire 3 dimensional area. For high level Convolutional Layers, specific ideas are being learned such as what type of molecules the protein is made of and where they are positioned within a protein.

[5]

The use of this 3-D data for protein prediction is important in this model as the structure of the protein is important for the tasks that it can perform and the family that it is a part of. Being able to see what type of protein most similarly relates to a test structure can help protein engineers fashion new sequences of higher performing proteins at a much better rate. The use of the model was shown to increase protein sequencing lab tests from around 1 percent with random guessing to around 20–50 percent.

A paper done by C. Rao and Y. Liu, the use of a 3 dimensional CNN was analyzed and showed the usefulness of these types of networks in capturing features of 3D environments. [5]

The Ellington Lab’s Pipeline: APBio

A large part of what makes all of this possible is the APBio pipeline which allows the atomic data which we take from RCSB protein bank to be converted into data structures readable by machine learning libraries. In this case all of the atomic coordinate files we gathered are converted by this pipeline into 4D tensors which are then able to be read by the CNN model to thereby produce meaningful results. The first step in this pipeline is to take the atomic coordinate files which contain all the different atoms and locations of atoms which make up this protein, and transform it into a Json file which contains all of this info. The next step of this process is to feed this Json of the microenvironment of the protein into what is called the “ExampleMaker”, this separates the atoms that the model cares about and makes predictions on, from the rest of the atomic microenvironment which is what each of these proteins is based on. This is important as not only would attempting to fully digitize these microenvironments be incredibly resource heavy in respect to our CPUs and GPUs, but there are only a very select amount of elements which are necessary for the model to make predictions. This set of important atoms is called the atomic collection. From there the atomic collection of each protein is sent into two different processes, one called the discretized space and the other are channel extractors. The discretized space is where we create the voxels to hold the data points for each atom at each space in the 3D coordinate space, these voxels hold data such as charge and chemical properties of molecules within the microenvironment. The channel extractor is what takes the atomic model and extracts all the necessary data for the new microenvironment of the atomic collection and sends the extracted data to the discretized space for storing inside of their respective voxels. These voxels are then converted into 4D tensors, which is necessary so that we can feed the data into our Tensorflow model as the voxels produced are not in the proper form to feed into the model. The final step in this process is to feed these 4D tensors into a queue which in turn sends them into the CNN model to be processed. The reason this queue is necessary is simply due to the fact that this pipeline is multithreaded, with each thread corresponding to this pipeline being run on a separate protein. This queue ensures that the microenvironment tensors are fed into the model in a fair and seamless manner. This entire process is illustrated below and the larger box represents the process on a single thread with the multicolored outlined boxes representing the multiple threads which are running to process each protein.

Experimentation with Dataset

Working with the pipeline, we set out to beta-test the ApBio library and explore the effects of altering parameters of the CNN to fine-tune our model. So too, we wanted to define a relationship between the proteins we train on and the effects on the test results.

The first experiment we tried was to look into sample size in relation to accuracy.

Due to limited time and computational power, we had to limit this search to 1000 samples from the hydrolase dataset, using 10 epoch each. The longest training time with this scheme was about 6 ½ hours. However, we do begin to see a general increase in accuracy even with this limited dataset. The prediction task is to construct probable amino acids for a protein that degrades a polymer in plastic efficiently (PDB ID: 6IJ6).

For 100 samples, we achieved an accuracy of 1.92%, and it appears that the model guesses that the amino acid is glutamic acid every time. It achieves this accuracy score as only 5 glutamic acids exist in the wild type structure.

For 500 samples, we achieved an accuracy score of 24.9%, and it appears that the model begins to assign probabilities with more variability.

For 1000 samples, we only slightly improve upon the previous score with 26.4% accuracy. We believe the marginal returns of increasing sample size could be due to bias from over-represented amino acids in the larger dataset.

We then explored the effect of increasing the number of epochs and iterations that we train the model on. For this, we used the hydrolase dataset with 1000 different proteins at 1.0–1.5 Angstrom resolution. This time, we used a protein that assists in carbohydrate metabolism (PDB ID: 1AQH).

With a single epoch, we see an accuracy score of 10%, though the guess is always glycine, this appears in the wild type of the protein with fairly high frequency, so the accuracy is somewhat high.

With 10 epochs, we see an accuracy of 26.4%. This performs significantly better than the single epoch that was underfitting the dataset.

With 25 epochs, we see an accuracy of 26.8%. This only slightly improves on the 10 epoch model that we built above. As such, the model learns specific parameters to the training data that may have been missed with fewer iterations.

With 50 epochs, we see no improvement in the accuracy and the confusion matrix is exactly equivalent to the model trained on 25 epochs. We believe this is due to the variance of the sample size being too low, thus the model is able to capture a majority of feature combinations with fewer iterations.

We then repeated the experiment with the dataset curated from the pfam database. There are two protein families associated with this structure, PF00128 and PF02806. We combined the two datasets to generate 704 proteins to sample from.

Unlike the hydrolase dataset, the pfam dataset became less performant with higher epochs. Having an equivalent accuracy score of 13.7% between 10 and 20 epochs, and 8.5% with 50 epochs, we believe that this is a result of loss in variation between the two datasets, resulting in overfitting upon higher iterations of training the model. It is also likely a direct factor of decreasing the size of the dataset.

Finally, we repeated the experiment on a random dataset of 1000 pdbs with 1.0–1.5 Angstrom resolution.

At 10 epochs, we achieved a score of 10.9% accuracy, then as we increased the number of epochs, we received lower accuracy scores of 8.93% at 20 epochs and 9.15% at 50 epochs. We believe that this is because the dataset contains too many unrelated proteins and the model is introduced to too much variance with too small of a dataset.

For all of our datasets, 10 epochs achieves nearly the optimal score, with the exception of the hydrolase dataset. We found that the hydrolase dataset achieves the greatest accuracy score of 26.4%. Such tells us that the variance and size of the function-related dataset is ideal in comparison to a completely random dataset (likely having too many unrelated patterns that confuse the model with too few samples), and the protein-family related dataset (which likely suffers from too little variance, as the model converges to extremely high training scores with very few epochs).

Conclusion

During the course of the project, our team went through several stages of learning and gained knowledge of machine learning in the bioinformatic field. We first learned about protein families and protein structures, this allowed us to set a clear goal for this project from the beginning. Then, we explored datasets from the RCSB database and learned about The APBio Machine Learning Model built by the Ellington Lab. This 3D-CNN model takes in atom datasets in voxels and feeds through convolutional layers, this allows the model to learn about the protein structure in a 3D space. Lastly, we experimented with this machine learning model by using different datasets and fine tuning the hyper-parameters. In the end, we successfully completed the goal of the project, which is to explore the synthesis of novel proteins. We also look forward to working with the Ellington Lab to build a more robust model.

Team Members:

Johnson Zhang, Ryan Davis, Peter Wagenaar

David Dickson, Alex Stahl, David Rollins

Possible with the help of the Ellington Lab at the University of Texas at Austin.