Biostar Beta. Not for public use.
Unsupervised subtype discovery
1
Entering edit mode
3.5 years ago
tucanj • 70
Canada

Attempting to discover subtypes of a disease from gene expression data (20 000 genes x 80 samples). I do not know the number of subtypes.

I can only find 1 review comparing methods: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-497

Are there any accepted protocols or methods? I see:

  1. Differing methods of filtering genes beforehand (eg. top 5000 genes by median absolute deviation). Is there an optimal number of genes to include?
  2. Different algorithms (k-means, consensus, Mclust)

Any input appreciated!

microarray R • 894 views
ADD COMMENTlink
1
Entering edit mode
16 months ago
chris86 • 290
United Kingdom, London

To update this. For a single platform, I have found empirically on TCGA RNA-seq data the best current algorithms (available in R) for this are M3C and CLEST. This data is not published yet. M3C is an improved version of the Monti et al. consenus clustering algorithm and CLEST has been around for ages and it works well. It is best to try a few different ones and see what works best on your data.

I would use the most variable genes only and try a few thresholds.

ADD COMMENTlink
0
Entering edit mode
15 months ago
Ahill ♦ 1.5k
United States

I'd say there are no hard and fast rules about how many genes to include, or what algorithm to pick, it's data-dependent. If data quality is good and there are truly sub-types to be found, you could probably succeed using any one of algorithms you list. If you are looking for a case study, this paper (https://www.ncbi.nlm.nih.gov/pubmed/20129251) is an example of a successful approach to to unsupervised sub-type discovery (in glioblastoma). They used consensus average linkage hierarchical clustering on 1740 genes.

ADD COMMENTlink
0
Entering edit mode
16 months ago
Minstein • 100

I'd say you'd better filter out some genes, because if some genes don't contribute to the subtype identification, they may add noises to your data. The assumption behind is that there should be some common features among different subtypes, but due to the measurement, there are variations in the common features. Therefore, I recommend applying feature selection or feature extraction before clustering. For example, you can try singular value decomposition or nonnegative matrix factorization.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1