Application of Artificial Intelligence Methods to Predict Subcellular Locations of Proteins
There are 20,000 protein-coding genes in mouse, rat and human genomes. The proteins for which these genes code mediate most cellular functions. Nearly half of these proteins have well characterized roles in cells, while others have not been studied extensively and are poorly characterized with regard to cellular functions. A major step forward in the last two years has been in the application of artificial intelligence approaches to predict properties of poorly characterized proteins (https://www.nature.com/articles/d41586-021-02037-0). An important question vital to the task of characterization of unannotated proteins is "In what subcellular compartment does a particular uncharacterized protein reside?" Whether a particular protein is located in the nucleus, the endoplasmic reticulum, the plasma membrane, or some other compartment is an important determinant of its function. In prior studies, we have applied protein mass spectrometry (proteomic) techniques to analyze the protein composition of subcellular fractions from differential centrifugation. A prior student has used the data to construct "Virtual Western Blots" from the proteomic data (https://esbl.nhlbi.nih.gov/Databases/VirtualBlotResource/). The current student project is to set up and train artificial neural networks to allow prediction of subcellular locations of all uncharacterized proteins from the differential-centrifugation proteomic data. Steps in the project include 1) k-means clustering to determine the optimal set of subcellular compartments to be named based on recognized cellular structures; 2) curation of sets of known proteins that map to the assigned cellular structures; 3) training of a single-hidden-layer neural net based on the protein sets identified in step 2; 4) validation of the trained neural net using known proteins held back from the training step; and 5) application of the trained neural net to predict the subcellular locations of all previously uncharacterized proteins. The resulting work will be published with the BESIP student as first author.
Intern Name: Ryan Hsu
Institution: Yale University
Project Title: Neural-net-based prediction of subcellular localization of all expressed proteins in kidney collecting duct cells