Entering edit mode
6.6 years ago
bzamith26
▴
10
Hello!
I need to extract features of protein functions. They are organized in a hierarchy, so one way that I thought about solving this issue was representing each node of this hierarchy as a vector containing its path from root. Something like this: https://imgur.com/a/gmvTv
I would like to know if anyone knows another way of extracting features of protein functions, hopefully something more related to biology, but I accept any suggestion. Thank you really much!
What kind of functions are you interested in ? Is it Gene Ontology biological process annotations ? Are you trying to derive feature vectors representing protein functions ? What are you trying to achieve with these features ?
Hi Jean! Thank you for your reply. I want to use machine learning to classify protein functions, but making use of interaction data... So I would need both proteins and protein functions described as a vector of features (which I only have for proteins). I want to use Gene Ontology database and FunCat as well, both hierarchical.
It's still not entirely clear how you plan on using the data. Do you want to use GO and FunCat as input or for validation ? What are the interaction data you want to use ? Regardless, consider that not all machine learning algorithms require a vector representation. For example, many algorithms can make use of kernels (e.g. support vector machines) and computing kernels doesn't always require vectors. For examples of kernels derived from a variety of data types (including GO annotations), look at this paper of mine and at this tutorial.
[..................]
I don't know the predictive bi-clustering tree algorithm, could you share a reference ? The problem with feature-based representations is to find features that are relevant to the problem at hand but also contain useful information. In the case of GO, you could simply create a binary vector representing all functions you care about. As for interaction data, you could use the rows of the graph adjacency matrix as vectors.
Here and here you have good references about PCTs (Predictive Clustering Trees). Bi-Predictive Clustering Trees are a new idea, and I know a few papers but they are under revision. Once they get published, I will update this!
"As for interaction data, you could use the rows of the graph adjacency matrix as vectors." = Great suggestion! I'll definitely consider that. Thanks!
Thanks for the links.