Question

Validating Results from NMF Clustering & Consensus clustering

0

Entering edit mode

4.8 years ago

David_emir ▴ 490

Hi,

I am running NMF/Consensus Clustering on my cancer samples and wanted it to cluster the samples into various subgroups, my question is how to conduct cluster assessment? Can I get P-value or something like that so that I can say my clustered samples are fine and validated?

Regards,

Dave

RNA-Seq validation • 3.3k views

ADD COMMENT • link updated 4.8 years ago by anbarasu.la ▴ 20 • written 4.8 years ago by David_emir ▴ 490

score 3 · Answer 1 · 2019-07-02

3

Entering edit mode

4.8 years ago

Kevin Blighe 87k

Hey David, I was hoping for Jean-Karim Heriche or chris86 to answer, as they have more experience in clustering.

Although they calI it 'Consensus Clustering', one should still obtain a consensus on the cluster solution from other programs / metrics. Others with which I'm familiar include:

Jaccard index
M3C
Gap Statistic
Elbow method
Siolhouette method
Tree cut height (simplistic but difficult to completely dismiss it as a metric)

I have also recently been utilising Seurat's functionality for finding clusters in data. It uses a KNN (k-nearest neighbours) and Jaccard as default,

Regarding p-values, I believe Consensus Clustering has already applied some statistical validation of the clusters that it derives (?).

Kevin

ADD COMMENT • link 4.8 years ago by Kevin Blighe 87k

1

Entering edit mode

As mentioned by Kevin, there are many ways of scoring the quality of a clustering result and none is perfect as they generally make some assumptions about either the structure of the data and/or what a good clustering should be. In many cases, what represents a good value for the score is not always easy to assess. However they can be useful in deciding between different clusterings. Ultimately what matters is how relevant/interpretable the outcome is. For example, you may get a very good clustering by some measure but you'll find that its granularity is too fine, for example splitting what you consider should be one group into two. Ideally, you want your clustering to give you some insight into the biological question you're interested in and maybe generate some hypothesis that you can then test independently (either by looking at the data differently or by doing an experiment).

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

score 2 · Answer 2 · 2019-07-02

Hi David,

You can use the Sigclust (https://cran.r-project.org/web/packages/sigclust/index.html) to assess the statistical significance of your clustering. You get both simulated p-values based on empirical quantiles and on Gaussian quantiles.

If you want to compare the results from NMF and CC, you can use RandIndex (or adjusted RandIndex). It has a value between 0 and 1, with 0 indicating that the two clusterings do not agree and 1 indicating that they are exactly the same. You can use the fossil (https://cran.r-project.org/web/packages/fossil/index.html) to test the RandIndex.

Hope this helps!

Anbarasu