Question

Spearman correlation between between two genes

0

Entering edit mode

5.5 years ago

Biologist ▴ 290

I would like to check the correlation between two genes based on RNA-Seq data. I know that rho is the Spearmans correlation coefficient.

Correlation coeffiecinet between -1 and 0 indicates negative correlation.

Correlation coeffiecinet between 0 and 1 indicates positive correlation.

When can you tell that the correlation is so strong? When I checked the correlation between two genes I see that rho is 0.2. Can I say now that the two genes have a strong positive correlation? Is there any specific value to say that it is strong?

One more basic question: For having a look at correlation between two genes I used RNAseq data. Do I need to take only tumor samples or should I also consider normal samples for checking the correlation?

thanks

RNA-Seq correlation spearman geneexpression • 4.1k views

ADD COMMENT • link updated 5.5 years ago by Alex Reynolds 35k • written 5.5 years ago by Biologist ▴ 290

1

Entering edit mode

I think there is more detail needed here about how the data is generated, but taking the statements at face value:

When can you tell that the correlation is so strong? When I checked the correlation between two genes I see that rho is 0.2. Can I say now that the two genes have a strong positive correlation? Is there any specific value to say that it is strong?

If the correlations (as you correctly stated) are in the interval [-1, 1] with positive and negative corresponding to the 'direction' of correlation, then I would say the answer is no you cannot say 0.2 is strongly correlated. You can say it is positively correlated, but 0.2 is much closer to 0 than it is to 1.

For arguments sake, a strong positive correlation would perhaps be > 0.5. Similarly, a strong negative correlation might be < -0.5. The strength of the correlation is directly measured by the magnitude of the number. That is to say, a correlation of 0 means the variables are entirely uncorrelated (obviously).

For the last question, I would say the answer depends on the hypothesis you're aiming to test.

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

Here is a good explanation of why these rules of thumb do more harm than good: https://doi.org/10.1111/j.1467-9639.2009.00387.x

The main point of this article is that correlation has always to be linked to a problem it is applied to. Thus, in one situation, r = 0.50 may be thought of as very strong whereas, in another, as very weak.

ADD REPLY • link 5.5 years ago by Michael 54k

0

Entering edit mode

Very true. I was having a discussion earlier today about the arbitrary-ness of Fold Change cutoffs and how there isn't a one size fits all rule. Same logic applies.

Its all relative (which is kinda what I was alluding to by saying that we need more detail about this particularl experiment).

ADD REPLY • link 5.5 years ago by Joe 21k

score 4 · Answer 1 · 2018-10-07

Maybe any two genes picked at random are likely to have zero correlation for your dataset — but who knows, really?

One way to know is to use your real data, generate a bunch of correlations from it, and see how things look in aggregate.

When you don't know whether observing a statistic is significant or not, one approach is to use bootstrap sampling.

One advantage of bootstrap sampling is that it is non-parametric. That is, you don't need to make as many assumptions about the underlying distribution of statistics in your population.

You could sample pairs of genes with replacement, calculate their Spearman rho correlations (or whatever statistic), and use that set of correlations to get summary statistics and build a confidence interval.

For instance, maybe you grab two genes at random 1000 times, calculating 1000 rhos. From those 1000 rhos, you can say something about the mean or median rho you'd expect to see over random combinations of any two genes, within some level of accuracy, i.e., confidence interval.

You could say that the correlation of any two random genes will fall within some confidence interval around the population mean correlation, about 95% of the time.

From that, if your two genes of interest have a correlation score outside that confidence interval, you might say the correlation of their signals is "significant" in that it less likely to be a "strong" correlation (or strong anti-correlation) by chance. This may or may not be biologically interesting, but that's a separate question.