Question

Assessing Cluster Reliability/Stability In Microarray Experiments

13

Entering edit mode

14.0 years ago

toni ★ 2.2k

In a microarray experiment, what people often do in first place (after preprocessing and normalization steps) is a hierachical clustering to observe how arrays and genes "naturally" organize themselves. That way, you could then identify subgroups of arrays (or genes, it depends what you are interested in) with a similar profile, depending on the distance and linkage you chose.

But when looking at this first picture of your data, you often identify arrays that would also nicely fit in another cluster. In contrast, each cluster have a set of arrays which is the heart of the cluster, which means the reason why this cluster exists.

Is there any known method(bootstrapping?)/Rpackage to assess clusters robustness/stability ? to identify a heart of cluster, exclude non relevant array from the cluster and move it to a cluster where it is more relevant ?

Most of all, if you have experience regarding these questions, which method would you advocate ?

Regards,

tony

clustering microarray gene • 5.7k views

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.0 years ago by toni ★ 2.2k

0

Entering edit mode

+1 for a very precise and well formatted question.

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

0

Entering edit mode

... thank you !

ADD REPLY • link 14.0 years ago by toni ★ 2.2k

Ram · Answer 1 · 2010-04-14

In their critical and excellent 2006 Review, Allison et al. provide a good overview of methods for MA data analysis. I personally would agree mostly with their conclusion:

We believe that unsupervised classification is overused; first, little information is available about the absolute validity or relative merits of clustering procedures second, the evidence indicates that the clusterings that are produced with typical sample sizes (<50) are generally not reproducible third, and most importantly, unsupervised classification rarely seems to address the questions that are asked by biologists, who are usually interested in identifying differential expression.

However, clustering can be useful to examine the values for a first glance overview, and also to see, if arrays correlate well, and if techn./biol. replicates are falling into the same cluster. An alternative if you are interested in how well arrays or experimental conditions correlate overall is to cluster the correlation matrix of the arrays (or of joint replicate measurements) and to plot a heatmap of the array correlation coefficient instead of plotting the expression heatmap.

Cluster analysis is an exploratory technique, as such there is no way to determine what is the "real" outcome. Any solution that helps you discover something is good and valid. (There exists an optimal solution for the clustering plroblem, given the optimization criterion (e.g. minimize inter/intra cluster-distance) of the algorithm though, but this is only of theoretical relevance, because all algorithms are approximative). Thus, you can freely exchange cluster members of a hierarchical cluster analysis of manually as you see fit, you might be able to find a solution that suits the data better.

Some things to look at:

We have some time ago internally tried out different cluster validation techniques (cluster indices), one that convinced me most conceptually is the Figure of Merit (FOM), Yeung et. al 2001. They use "jackknifing" (leave one out) which related to bootstrapping you mention.
Some more cluster indices are implemented in the R function cluster.stats in the fpc package
or the randIndex in the flexClust package
Also: the clusterRepro package looks like something to try (haven't myself)
And another article with overviews of cluster indices used for microarrays: Smolkin & Ghosh, 2003
And yet another article of Datta & Datta from 2006 including two indices using external knowledge (e.g. GO). The biological stability index (BSI) is another stability index.

BTW.: it seems to be time for cluster-indices-indices to evaluate the performance of different cluster-indices ;)

My personal conclusion was however that all methods evaluate something very different but they do not help a millimetre to get to the answer of biological questions.

Ram · Answer 2 · 2010-04-14

7

Entering edit mode

14.0 years ago

Ian Simpson ▴ 960

I have just written an R package called clusterCons that is an implementation of the method for clustering robustness assessment described in:-

S. Monti. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, July 2003.

Which follows something very similar to the re-sampling approach very nicely described by PhiS (above). The package is in alpha at the moment, but is being used by quite a few other groups right now as part of our functionality testing, we are just finishing a paper describing the package and its application to cluster and gene prioritisation. The method described by Monti is very simple and elegant and has been cited in >100 papers to date. The clusterCons package can use any kind of clustering provided that the clustering function used returns a result that can be formatted as a cluster membership list, so could be used with supervised clustering (all you have to do is write a small very simple custom wrapper for any new functions), but is currently written to use the methods provided by the 'cluster' package in R which are all unsupervised (so you can currently use 'agnes', 'pam', 'hclust', diana' and 'kmeans' out of the box). If you are interested in trying it you can get it through CRAN or sourceforge and I am very happy to help you on your way if you decide to try it out.

Unfortunately, although I understand where he is coming from I actually disagree with Michael and the review by Allison, and I am both a biologist and a computational biologist. I think that supervised clustering is often used with a belief that it is going to provide biologically meaningful clusters when in fact it produces clusters that are heavily biased by the supervising information. The problem with this is that the biological supervising information is almost always of very poor and often unverifiable quality. It could be for example transcription factor binding data (hugely biased, data sparse and noisy), ChIP-ChIP/ChIP-seq (very high noise), patient classification (unquantified, obtuse, unverifiable or just plain wrong) to name but a few. I'm not saying that supervised clustering doesn't have a place, but I avoid it like the plague and much prefer to follow up unsupervised clustering (with robustness measures) with some proper biological validation. I hope this doesn't come across as antagonistic to Michael, clearly he has a lot of experience with clustering as well, just wanted to let you know how we handle the problem.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 14.0 years ago by Ian Simpson ▴ 960

0

Entering edit mode

This is far from being antagonistic to what I said, may intent was neither to promote supervised nor unsupervised cluster analysis. Did you intend to view this as a contradiction? Indeed, I am convinced that most of the time a well-planned experiment can be broken down into a series of hypothesis tests. Even though classification (un- and supervised) has its merits, from my experiments the wish to cluster something arises from not knowing what else to do with ones data. For a plotting heatmaps though, hierarchical clustering is essential as a techinque to order rows and cols.

ADD REPLY • link 14.0 years ago by Michael 54k

0

Entering edit mode

This is far from being antagonistic to what I said, may intent was neither to promote supervised nor unsupervised cluster analysis. Did you intend to view this as a contradiction? Indeed, I am convinced that most of the time a well-planned experiment can be broken down into a series of hypothesis tests. Even though classification (un- and supervised) has its merits, from my experience, the wish to cluster something arises from not knowing what else to do with the data. For a plotting heatmaps though, hierarchical clustering is essential as a techinque to order rows and cols

ADD REPLY • link 14.0 years ago by Michael 54k

0

Entering edit mode

@Michael: "they do not help a millimetre" certainly qualifies as non-promotion. :-)

ADD REPLY • link 13.9 years ago by Hanif Khalak ★ 1.3k

0

Entering edit mode

Dear Ian

I am currently trying to use ClusterCons for testing cluster stability and robustness. I am new to R and I tried using your package for clustering and studying the results on the datasets listed in Monti et al's publication.

Since I am only trying to implement the same and understand how the package works, I understand that the functions in the package are already clearly defined. I would like to know if there are any reference materials for understanding the output expected from using these functions, since the PDF file explains the way to run these functions and I would like to understand and analyze the results of using these functions more.

Thanks!

Aparna

ADD REPLY • link 12.9 years ago by Aparna • 0

0

Entering edit mode

Hi Aparna, welcome to BioStar please add this as a comment to Ian's answer, that way he will be notified, adding this as an answer on the other hand he will possibly not notice that at all. If you have a more specific question about clusterCons, pls post a question on biostar.

ADD REPLY • link 12.7 years ago by Michael 54k

Ram · Answer 3 · 2010-04-14

To test the robustness/stability of groups of samples in clustering, typically bootstrapping is used (or related techniques, such as jackknifing), and summarised in a (typically majority-rule) consensus tree. That your input data for the clustering come from microarrays is largely irrelevant. Essentially, in bootstrapping you're asking the question what proportion of your input data support a certain grouping. It is done by making a pseudo-replicate of your input data (N samples x M data rows), keeping the same number of samples N constant, but M times randomly choosing new data rows, like such (pseudo-code):

for j= 1 to M {i= random(1,M)
               DataReplicate[j]:= Data[i]}

where Data[i] is an array of values with entries for each sample. So, in the end, your pseudo-replicate has the same dimensions N x M as your input data matrix. The entire procedure of clustering is then repeated for each pseudo-replicates and clustering results compared by consensus tree methods.