Question

Clustering method for p-value data

0

Entering edit mode

4.9 years ago

mnsp088 ▴ 100

Hi everyone!

I have run some statistical tests that resulted in the data below. Now I just want to cluster column 2 based on the pvalues shown in column 3. I tried using mcl mcl input_Data.txt --abc -I 1.5 -o mclOutput and the result was that Y1, Y2, and Y3 each form their own individual clusters, while Y4-Y11 formed one cluster together. I was expecting that Y1-Y3 would at least form a cluster together since their values are quite similar. Perhaps mcl is not the optimal clustering method for this kind of data? Any suggestions?

X1  Y1  9.98E-196
X1  Y2  6.88E-193
X1  Y3  3.32E-184
X1  Y4  0.000254
X1  Y5  0.00032
X1  Y6  0.000765
X1  Y7  0.00117
X1  Y8  0.00278
X1  Y9  0.0148
X1  Y10 0.0175
X1  Y11 0.0474

Thank you.

clustering mcl • 1.2k views

ADD COMMENT • link updated 4.9 years ago by Mensur Dlakic ★ 27k • written 4.9 years ago by mnsp088 ▴ 100

2

Entering edit mode

Well, the distance between Y1 and Y2 is larger than Y4-Y11 and Y2-Y3 is even larger so it makes sense. There isn't a lot of point in clustering using only one variable, you can instead divide them into groups arbitrarily.

ADD REPLY • link 4.9 years ago by Asaf 10k

0

Entering edit mode

Thanks, but I will be producing hundreds of these kind of tables, so I would prefer automating the clustering/diving into groups task.

ADD REPLY • link 4.9 years ago by mnsp088 ▴ 100

4

Entering edit mode

Optimal granularity of a clustering is often in the eyes of the beholder. Anyway, I am curious as to what the purpose is. Since the magnitude of a p-value says nothing about what's been measured, p-values are not of much use for anything except to try and avoid false positives. Plot the ranked -log(p-values) and look at the shape of the curve, most likely you'll have a few extreme values and a very long low tail.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you, basically the purpose is to assign the most representative Y for X1. I do not want to go with the smallest p-value because there are others such as Y2 and Y3 which are also extremely small and significant, so I am trying to group the Y's in some meaningful automated way. For example something that outputs: the most representative Y for X1 is Y1, Y2, and Y3.

The plot is here: https://imgur.com/u3RpJ4M Are you suggesting using a cutoff based on the density plot?

ADD REPLY • link 4.9 years ago by mnsp088 ▴ 100

1

Entering edit mode

How do you define most representative? Typically, this could be the median. On the face of it, p-values are the wrong thing to use because they do not have a direct relation to the values that were tested and a low p-value indicates an extreme outcome (under the null hypothesis of the test). Of course I can be wrong here because I don't know the details of your data. In the linked density plot, the distribution is clearly bimodal so this could represent two clusters but my suggestion was to simply look at the values plotted in decreasing order.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Thank you, basically the purpose is to assign the most representative Y for X1.

I don't think you will be able to do this by clustering. If I understand your setup correctly, none of the Y variables have relationships to each other. That is, X1 is the only "hub" that indirectly connects Y variables. If so, even though extremely low p-values is what you want, they end up nullifying those edges between X1 and Ys.

You may want convert the third column into -log10(p-values). That way the more statistically significant Ys will at least be assigned stronger weights. Next, play with the inflation factor (-I) in a 0.5-8 range and see if that makes any difference. I suspect that all Y variables that have -log10(p-values) > 0 will end up in the same cluster, which probably is not what you want.

ADD REPLY • link 4.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thanks Mensur Dlakic ! I tried converting to -log10 and re-running, now all Y's are in the same cluster (regardless of -I inflation value) as you suspected.

ADD REPLY • link 4.9 years ago by mnsp088 ▴ 100

score 5 · Accepted Answer · 2019-06-28

5

Entering edit mode

4.9 years ago

Mensur Dlakic ★ 27k

I was expecting that Y1-Y3 would at least form a cluster together since their values are quite similar.

For all practical purposes, the edges connecting X1 to Y1-Y3 are all zeros. Since Y1-Y3 don't have any direct edges between each other, they end up as singletons.

With this data I don't think any clustering approach would give you a cluster you were expecting, though I understand that in your mind Y1-Y3 are statistically significant with regard to X1 and you are expecting them to group together.

ADD COMMENT • link 4.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for the explanation, I think that makes sense, can you just elaborate further on why the edges of X1 to Y1-Y3 are zeros? I think I need to look into 1D discretization/segmentation instead for this problem..

ADD REPLY • link 4.9 years ago by mnsp088 ▴ 100

2

Entering edit mode

Because mcl is a graph clustering algorithm and the way your input file is set up, it is read as node X is connected to node Y1 with edge weight 9.98E-196.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k