Question

appropriate statistics to test for differences in heterogeneous features between transcript groups

1

Entering edit mode

9.9 years ago

Floris Brenk ★ 1.0k

Hi all,

I have a statistics question I've been thinking about for a while but for which I didn't find a satisfactory answer so far. I would really like to hear your thoughts about it and I apologize in advance if a similar question was already asked (I couldn't find it).

To summarize the problem, let's imagine to have two sets of transcripts (transcripts A and transcripts B) and to be interested in understanding whether
1) transcripts in group A are more/less likely to contain CG-rich regions in their body than transcripts in group B
2) for transcripts in group A the proportion of sequence (base pairs) that sits in a CG-rich region is higher/lower than in group B

In principle (in my understanding) question 1) could be addressed using Fisher exact test (by creating a 2x2 contingency table reporting how many transcripts in A overlap a CpG island, how many do not, how many transcripts in B overlap a CpG island and how many do not). To start with, is this correct? Concerning question 2), however, it seems strange to apply the same strategy, because the numbers are much larger and therefore p-values tend to be very small also for very small differences...

I am honestly quite confused at the moment and I would find it extremely helpful to have some feedback.

Thank you in advance for any help you might be able to provide!

statistics CG cpg • 2.3k views

ADD COMMENT • link updated 9.9 years ago by mikhail.shugay 3.5k • written 9.9 years ago by Floris Brenk ★ 1.0k

Ram · Answer 1 · 2014-05-26

1

Entering edit mode

9.9 years ago

Devon Ryan 104k

This could be rewritten as asking if a dichotomous outcome is dependent upon a continuous variable, which is a logistic regression (you can do this in R). You could do a Fisher's test, but then you're setting an artificial threshold for "CG-rich", which may not be as useful.
So you set an artificial threshold for "CG-rich" and then performing a similar test.

Anyway, it would seem more interesting to ask if the GC content differs between the groups, which is a simple T-test or (better yet, since there are ceiling/floor effects) a wilcoxon test.

ADD COMMENT • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hmmm you say it would seem more interesting to ask if the CG content differs between the two groups, and this would answer exactly my second question. But then what you mean is to just quantify the CG content for each transcript in both groups and throw the vectors in a Wilcoxon test? E.g. bluntly count the % of the sequence that is C or G and see if on average transcripts in group A are more/less CG rich than transcripts in group B?

ADD REPLY • link 9.9 years ago by Floris Brenk ★ 1.0k

1

Entering edit mode

I should have clarified that without knowing the biological background to you asking the question, that I'd find it more interesting to look simply at GC differences. For that, yeah, just throw the vectors in a Wilcoxon test, which is simple enough.

Perhaps you're looking at differences in CpG island occupancy of coding regions or something like that, in which case my answers to (1) and (2) above would probably be more useful.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k

Ram · Answer 2 · 2014-05-26

There is nothing bad in reporting a low p-value in case of high number of transcripts, as long as you provide some effect size with this. E.g. odds ratio for group A and B is 1.3. P-values are there to tell that the difference between your variables is statistically significant. If it is, then you are ok to examine the size of that difference and discuss if it really matters in biological sense.

PS. Just in case, if you're comparing several groups of genes using paired tests you should not forget to adjust p-value for multiple testing