Question

Optimal # clusters

0

Entering edit mode

6.8 years ago

mforde84 ★ 1.4k

This is more of a general question / situation.

I was recently brought on to a project which generates network fusions between mRNA and miRNA data from cancer patients. The deliverable is 3 distinct clusters / groups of patients which show significant differences in molecular profiles and survival.

For sake of posterity, I redid the analysis that the previous bioinformatician did, and generated the same raw results. However, there was a major discrepancy between how we interpreted the optimal number of clusters. In total, we both tested ~150 different parameters into the test. To find the optimal solution and # of clusters, I took the parameter set which ranked the highest median silhouette width. Whereas the previous bioinformatician, did some sort of PCA analysis on each dataset individually prior to the network fusion, then took the result with the highest median silhouette width for the predetermined # clusters.

Overall, I'm interpreting 2 clusters as optimal with a silhouette width of ~0.8, and he's interpreting 3 clusters as optimal with a silhouette width of ~0.07. So a pretty staunch difference. Interestingly enough, I think both solutions capture one of the groups quiet well, in particular the better survival group because no matter how they are clustered they are spread far apart from the other 1/2 groups. The remaining two groups essentially overlap to a large degree while having similar functional pathway analyses and survival. So my thinking is that it may be better to just take the 2 clusters as the optimal solution. If not only for the shear fact that it's more statistically sound to do so, then for the reason that biologically the 2 suspect groups are very similar to one another. Which for purposes of validation, would be easier to test. I mean it would be impossible to validate 3 groups if 2 of the groups are artificial, right?

Would anyone mind giving me advice on this? I'm running up against a wall with my people on this. I don't want to rock the boat on this, but I honestly believe we made a mistake. I try to bring it up, and present the evidence, but one person in particular is just dead set on ignoring it outright as some sort of statistical facade (interesting to note, it's not the bioinformatician).

Ridiculously frustrated over here.

cluster • 1.4k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 6.8 years ago by mforde84 ★ 1.4k

score 1 · Answer 1 · 2017-07-11

I can't give you advice from a "scientific" perspective but perhaps I go on a tangent on the "psychological" dimension of it.

If I got the gist of it right, your collaborators are hoping to match their expectations of seeing 3 groups - hence the reticence in accepting your results. There is a danger in pushing too hard - even if you are right - especially since as you point out that the results are similar and, in the end, the conclusions may end up being similar.

The best strategy is to steer the discussion in a way where it is them that come up with the rationale of why 2 groups are better than 3. In the end, in my opinion, it is not a battle worth fighting head on since there could be many other issues to deal with down the line.

Finally, in my experience, the models are almost never quite right - it might be that in reality there are neither 2 nor 3 groups - instead you have a mixture of data where the projection of it looks more like 2 in some cases or more like 3 in others.