Question

Unsupervised selection of inter-cluster highly variable genes

0

Entering edit mode

5.5 years ago

elb ▴ 250

Hi guys,

I have a big data.frame of RNA-Seq counts in which rows are genes while columns are samples.

I clustered this big matrix and I identified 6 major clusters. They have some common genes, i.e. genes that do not show a huge variation between the samples (around 100 patients) and some genes that characterize each cluster because the expression is different between the clusters. For example: in one cluster 10 genes are highly expressed while in all the other clusters the same genes are poorly expressed and do not change substantially comparing to the first cluster. Is there a way to select the highly "significant" or variable genes that characterize each cluster with respect to the others in order to end up with a list of cluster-specific genes whose expression is peculiar of that cluster? I know that a way is to perform a log2 (fold change) but I would like to perform this analysis in an unsupervised way without to select the comparisons for the fold change calculation. Can anyone help me with some idea or references so that I can select the cluster-specific relevant genes?

Thank you in advance

e.

R RNA-Seq variance • 1.3k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.5 years ago by elb ▴ 250

score 2 · Accepted Answer · 2018-10-16

2

Entering edit mode

5.5 years ago

Kevin Blighe 87k

You are looking for genes that characterise each of your sample clusters. I published a recent work where I did this with metabolomics data. Clusters were first identfied via Partitioning Around Medoids (PAM) and high/low metabolites in each identified via the medoid values returned for each gene in each cluster. I later showed that the most interesting genes did indeed differ [statistically] between each cluster via ANOVA.

In your case, you have neither elaborated on your clustering mechanism nor on how you identified 6 major clusters. One thing that you may consider is transforming your data to Z-scores and then taking Z > +2 as high expression in a particular cluster and Z < -3 as low expression in a particular cluster. In this way, you can define a set of genes that characterise each cluster.

Kevin

ADD COMMENT • link 5.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you Kevin for your answer. I simply normalized my data and then performed an unsupervised HCA with Pearson correlation as a measure of distance. I have no reference samples. Then the clusters appeared. I appreciate a lot your work. I think I could be inspired by it. Thank you a lot.