Attempting to discover subtypes of a disease from gene expression data (20 000 genes x 80 samples). I do not know the number of subtypes.
I can only find 1 review comparing methods: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-497
Are there any accepted protocols or methods? I see:
- Differing methods of filtering genes beforehand (eg. top 5000 genes by median absolute deviation). Is there an optimal number of genes to include?
- Different algorithms (k-means, consensus, Mclust)
Any input appreciated!