Classification of Disordered Residues in Intrinsically Disordered Proteins

Thesis Statement

It is a many-to-many problem for which we designed a neural network composed of Bidirectional ConvLSTM & effective skip connections to predict the chances of disorderness of each amino acid in a protein sequence.

Technologies:


Dataset

We used data provided by SPINE-D. They obtained 4229 non-redundant, high-resolution protein sequences from the Protein Data Bank (PDB) and Database of Protein Disorder (DisProt). These include 4157 X-ray crystallography structures (deposited to the PDB prior to August 05, 2003) and 72 fully-disordered proteins from DisProt v5.0. These chains were randomly split into a training set (Training) of 2700 chains, a validation set (Validation) of 300 chains, and a testing set (Test) of 1229 chains.


Neural Network Architecture

The protein Sequence is fed into an embedding layer to get a continuous and meaningful representation of the protein. This meaningful representation goes into a sequence feature extraction block that tries to learn and map significant protein sequence features and condense all the relevant neighborhood relationships into a 512 vector for each amino acid. This vector which represents all meaningful information about the protein sequence for this particular amino acid is used to classify the residue as either ordered or disordered. The feature extractor consists of a series of modified Bi-directional ConvLSTM blocks with efficient use of skip connections for better feature reuse and robust feature representation. It is followed by a Time Distributed Dense Block which distills the features to classify each amino acid residue as disordered or ordered. We also used a customized loss function to address the imbalance between ordered and disordered amino acids.