Biostar Beta. Not for public use.
Best Representation of Protein Sequences
Entering edit mode
20 months ago
khaeuk • 70


I am currently working on a project that utilizes neural networks on protein sequences. However, I am quite stuck on what the best way of representing protein sequences. I will be getting protein sequence data from PDB, but each sample will have a different length of sequence. Ideally, I would like to represent the all the protein sequences in fixed size to pass into the neural network.

I found some packages that can get me numerical representation of features (descriptors), but all of their dimensions are different. For example, amino acid composition is dimension of 20, dipeptide composition is dimension 400, autocorrelations are dimension of 240.

I did thought about perhaps aligning them to get fixed length of sequences, but then I'm confused of how to represent insertions/deletions for each descriptors. Anyhow, I would like to know what are some good ideas to represent protein sequences?

Thank you so much!

Entering edit mode
15 months ago
Asaf 5.6k

From my limited knowledge in NNs I think that you shouldn't represent 3D structures using features. The great benefit of NNs is that the machine can generate features. I think that you would want to have the angles between AAs as input and the AAs themselves so you'll have 20+ dimension vector. You do have a problem with unequal length but I think you can overcome this using layers of the network, a function that takes input in varying size and output a fixed size vector. It all depends on what you're trying to predict.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3