Finding Correlation between methylation and RNA-Seq data and their significance
1
0
Entering edit mode
5.6 years ago

I had a couple of questions.

First - Is there any specific reason to pick spearman or pearson correlation? Which one is usually the more preferred one and why?

Second - For computing significance of the correlation, a common method employed is to compute the t-statistic from the correlation value and then compute the p-value based on t-distribution. But my question doesn't the underlying quantity need to be normally distributed. Any clarity on the statistics side would be helpful.

correlation RNA-Seq methylation • 1.4k views
ADD COMMENT
2
Entering edit mode
5.6 years ago

The Spearman Correlation is typically used because it is non-parametric. I'm not sure how much it would matter in this case, but (in the very general sense) you may sometimes prefer a metric that takes the magnitude of difference (not just ranking) into consideration. For example, if expression is similar but rankings can vary in a way that isn't biologically meaningful (particularly among genes with lower expression), my personal opinion is that use of the Pearson Correlation may be preferable (even if there are theoretical arguments about the normality assumption).

I believe you are describing the strategy for p-value calculation in the R cor.test() function. Again, you are right about there being certain assumptions (and there may be overall strategies that are in fact better in a particular circumstance); however, each individual feature may be more normally distributed among control samples than the overall set of values for a given sample (if you are talking about a per-feature test). If you had 100% methylation with low expression and 0% methylation with high expression (in a 2-group comparison, with good concordance between replicates), then you would also have a strong negative correlation (even though the overall methylation distribution for that feature is bimodal); however, if you have outlier samples, it is possible you may want to use some other sort of test/score (although, if you already have filtered for differential methylation, that should help with replicate concordance on one end).

To be clear, it is very possible that you can use something different than a correlation that works better for your particular project. However, my opinion is that you may find the identification of candidates reasonable (even with what you are describing, with assumptions of normality that aren't exactly met, particularly if performing one test for the distribution per sample).

In the case of differential methylation (particularly with BS-Seq data), you may find that a standard statistical test on percent methylation may have relatively low power (in which case, differences would be even less significant with the non-parametric test, or maybe even less significant for certain more complicated tests with a beta-binomial distribution, or the glm() logistic regression p-value for percent methylation). This may not necessarily be bad (if you want clear differences with good concordance between replicates). However, if you focus on the normality assumption to the extent that you define a methylation distribution that looks very different than your original signal (particularly if it makes some large methylation differences less significant, and small differences near 0% and 100% more significant), then I think it is at least worth seeing what differences with more direct measurements look like (so, comparing methods with different strategies for your project). I also think visualizing the percent methylation values is worthwhile, even if you have a p-value calculated with some sort of transformation (or count-based test).

ADD COMMENT
0
Entering edit mode

Hi Charles,

Thanks for replying. You are correct to mention that I am describing the strategy of R function cor.test to compute p-value. I have two groups tumor and non-tumor. As such biomodal distribution exists but for correlation I have to use both the groups. So this makes me uncomfortable to employ the above function for computing p-values for correlation.

Are there any other approaches that exist?

ADD REPLY
2
Entering edit mode

It's kind of hard for me to give advice for your specific project.

In general, it is important to have ways to critically access the results, independent of the method used for analysis (or at least some way to gauge the results to decide whether they look reasonable or not).

For example, I apologize that my earlier response was long. However, if you have correlations close to -1 or 1, perhaps take a look at those and plot the percent methylation values for that feature. This may be harder to see for patient data compared to cell line data. However, if you see such results with a bimodal distribution overall, that means it wasn't really a serious concern (if you can get candidates with your expected trend, I would say that is what really matters).

While I would expect more heterogeneity in a patient dataset, I think you can use a cell line example from the COHCAP paper to show that I am trying to explain:

COHCAP PCA Figure 2

This is actually a PCA plot, but it is harder to directly insert the image from Supplemental Figure S6. However, you can see differentially methylated regions with a negative correlation in gene expression that was identified with a Pearson Correlation, where the overall distribution was bimodal (but the distribution within either group would probably have been roughly normally distributed.

In terms of the differential methylation step, I have some comparisons for those 4 genes in the Protocol Exchange paper:

Protocol Exchange Table 2

In terms of the comparison with gene expression, if the strategies described in COHCAP are not sufficient, some sort of custom analysis may be useful. I've also heard of WIMSi, but I have to admit that I haven't really tested it.

ADD REPLY

Login before adding your answer.

Traffic: 2799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6