Question

Cluster Protein abundance matrix only on similar expression shape to find "coexpressed" Proteins

0

Entering edit mode

7.7 years ago

Benni ▴ 30

I have a Matrix with Proteins in the rows and conditions in the columns. The values are relative changes in the Protein abundance compared to a standard condition. E.g. 1 -> no change; 1,5 -> 50% more; 0,5 -> 50% less The values range from 0 up to 6000, but are mostly in the range around one. The matrix can also be log2 normalized to retrieve a normal distributed data.

My goal is to find Protein cluster, that have a similar expression behavior over the conditions. I worked with Python and Sklearn clusters. First I tried to use kmeans. I had to log2 transform to data to decrease the influence of the outliers. But I still get clusters seperated by their change values, not their shape. In the example pictures you can see, that Proteins, that move around 0 (or 1 without log2) are clustered together, but Proteins with higher fold changes are seperated from the others. https://pl.vc/pjfik / https://pl.vc/8dggb

Then I tried Agglomerative Clustering. Here I also had to log2 transform to reduce the separation of the outliers. I used the "Cosine" Metric (http://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering_metrics.html#example-cluster-plot-agglomerative-clustering-metrics-py) "The cosine distance is invariant to a scaling of the data". The clusters look better, but there are still huge clusters with values around 0 (or 1 without log2). https://pl.vc/q0ikc / https://pl.vc/1jh54b

Are there Cluster Algorithms that are specific for that, maybe developed for biological meaning? Extra Question: I thought about using Bicluster Algorithms. Which could be the right one and are there implementations for Python or maybe R, etc.. ?

Cluster Protein Coexpression • 1.6k views

ADD COMMENT • link updated 7.7 years ago by Jean-Karim Heriche 27k • written 7.7 years ago by Benni ▴ 30

score 0 · Answer 1 · 2016-08-15

The problem is not with the clustering algorithm but in the way you're measuring similarity between your proteins. Cosine is invariant to scale but not to shift. Try Pearson's correlation coefficient instead.
If you don't care about the absolute values but only the change from one condition to the next, then your similarity measure should reflect this. Given a fixed order of conditions, you could try computing the derivative at condition i (e.g. approximating it as (value(i+1)-value(i-1))/2) and use these derivatives for computing your similarity/distance measure.