Question

Unsupervised subtype discovery

1

Entering edit mode

7.4 years ago

tucanj ▴ 100

Attempting to discover subtypes of a disease from gene expression data (20 000 genes x 80 samples). I do not know the number of subtypes.

I can only find 1 review comparing methods: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-497

Are there any accepted protocols or methods? I see:

Differing methods of filtering genes beforehand (eg. top 5000 genes by median absolute deviation). Is there an optimal number of genes to include?
Different algorithms (k-means, consensus, Mclust)

Any input appreciated!

microarray R • 2.0k views

ADD COMMENT • link updated 5.4 years ago by Min Dai ▴ 160 • written 7.4 years ago by tucanj ▴ 100

score 1 · Answer 1 · 2018-11-24

To update this. For a single platform, I have found empirically on TCGA RNA-seq data the best current algorithms (available in R) for this are M3C and CLEST. This data is not published yet. M3C is an improved version of the Monti et al. consenus clustering algorithm and CLEST has been around for ages and it works well. It is best to try a few different ones and see what works best on your data.

I would use the most variable genes only and try a few thresholds.

score 0 · Answer 2 · 2016-12-03

I'd say there are no hard and fast rules about how many genes to include, or what algorithm to pick, it's data-dependent. If data quality is good and there are truly sub-types to be found, you could probably succeed using any one of algorithms you list. If you are looking for a case study, this paper (https://www.ncbi.nlm.nih.gov/pubmed/20129251) is an example of a successful approach to to unsupervised sub-type discovery (in glioblastoma). They used consensus average linkage hierarchical clustering on 1740 genes.

score 0 · Answer 3 · 2018-11-24

I'd say you'd better filter out some genes, because if some genes don't contribute to the subtype identification, they may add noises to your data. The assumption behind is that there should be some common features among different subtypes, but due to the measurement, there are variations in the common features. Therefore, I recommend applying feature selection or feature extraction before clustering. For example, you can try singular value decomposition or nonnegative matrix factorization.