I wanted to share two posts and ask your take about it...and also write some thoughts out for myself to put ideas together :)
I came across those two posts from cross validated and I thought they can be relevant for this forum too:
Here are my thoughts: there are lots of R packages that can be used to find differentially expressed genes, one of my favorite is limma()
, which runs a t.test through moderated t.statistics through "borrowing" information from all the genes in order to estimate the data about a single gene (correct me if I'm wrong). However, in my microarray analysis classes I was taught to use t.test
for DEGs when we had "enough" samples.
The word "enough" always confused me, 3-5-10-100-10^89?! I didn't know that t.test
was initially developed to analyze 4 samples, and although more elegant ways of evaluating real differences in the mean distribution of two samples have been developed, t.test
is still widely used. So, what's your thought about the use of t.test
for DEGs with N > 20-30? Would you completely discard it? Would you still use it for big sample sizes?
Now let's say we have more than two conditions. Here another example is anova. Anova analyses, by definition, the variance across samples. A good and detailed description is in this file from the Jackson Laboratories. Now, if we have just a "few" (same as "enough", very irritating word) samples then estimation of the variance can be tough and here's where tools like limma()
come in handy: it uses values from other genes and it "shrinks" the variance. But then, let's say we have N > 30, would you still consider using ANOVA?
Lastly, normality. I often underestimated the importance of normally distributed data, then started reading about central limit theorem, parametric and non-parametric testing. Quick recap: parametric testing = t.test
/anova; non-parametric testing = wilcoxon rank test/mann whitney test, and I found it nicely mapped in a table here. So, my question is: when you analyze your data, how much weight do you put on their distribution? i.e. do you run a shapiro test to check the distribution or just go with "how their distribution looks like"?
Lot of this goes back to power, but let's assume that we are given a set of data to analyze and we are not designing the experiment from the get-go...or if we do we have limited $$$...oh wait...I forgot that in research money is not a problem (..add sarcastic grin... lol) :)
As a small test, I ran limma and t.test on a set of 28 normal and 32 tumor samples from some CEL files we had in the lab. The list of DEGs with the same thresholds (p.value <0.05 and log2FC > 1.5) is exactly the same but pvalues, as expected, are lower in limma.
Alright, I think that's it for now...thanks for reading it, I thought about those things for awhile so I'm curious to see what others think.
I haven't mentioned analysis like SAM or resampling because otherwise it'd become too long of a post feel free to share your ideas about them too
Thanks :)