EdgeR differential gene expression has impossibly low seeming P values and FDRs
1
0
Entering edit mode
7.7 years ago

Hello,

I've been trying to figure out how to use EdgeR to get differential gene expression. I use Tophat->htseqcount for input. I have 4 different conditions and 2 timepoints and 3 replicates (so a total of 24 samples). Using either glmfit or glmQLFit I get many thousands of signficantly differentially expressed genes between two conditions. The P-values and FDRs seem impossibly small (in the range of 10^-85 small). (Also using cuffdiff on the same dataset, I get roughly 10 differentially expressed genes per condition, which is a very different number of genes. I know that Cuffdiff tends to be more conservative, but surely it isn't THAT much more conservative.

If I do a volcano plot to visualize the data, it looks like this: All the genes are negatively downregulated. I'm pretty sure there is something going on oddly but I'm not sure what it could be. Do I need to correct p-values somehow?

enter image description here

volcano plot comparison... really oddly skewed data

This is the code I used:

library(edgeR)
#import files into R
files<-c('x18hpf_control_s1_htseq.txt','x18hpf_control_s2_htseq.txt','x18hpf_control_s3_htseq.txt','x18hpf_5A_s1_htseq.txt','x18hpf_5A_s2_htseq.txt','x18hpf_5A_s3_htseq.txt','x18hpf_5B_s1_htseq.txt','x18hpf_5B_s2_htseq.txt','x18hpf_5B_s3_htseq.txt','x18hpf_5a5b_s1_htseq.txt','x18hpf_5a5b_s2_htseq.txt','x18hpf_5a5b_s3_htseq.txt','x21hpf_control_s1_htseq.txt','x21hpf_control_s2_htseq.txt','x21hpf_control_s3_htseq.txt','x21hpf_5a_s1_htseq.txt','x21hpf_5a_s2_htseq.txt','x21hpf_5a_s3_htseq.txt','x21hpf_5b_s1_htseq.txt','x21hpf_5b_s2_htseq.txt','x21hpf_5b_s3_htseq.txt','x21hpf_5a5b_s1_htseq.txt','x21hpf_5a5b_s2_htseq.txt','x21hpf_5a5b_s3_htseq.txt')
DB<-readDGE(files,columns=c(1,2))
DB$samples
#make a design matrix
timepoint<-factor(rep(c("t18","t21"),each=12))
groups<-factor(rep(c("wt","tbx5a","tbx5b","double"), each=3,times=2))
combinations<-paste(timepoint,groups,sep=".")
f<-factor(combinations,levels=unique(combinations))
design<-model.matrix(~0+f)
colnames(design)<-levels(f)
design
# Filter and QC -----------------------------------------------------------


#filter out genes with low counts
#use counts per million bs of differences in library sizes
#1 CPM is approx = to 10 in smallest sample (96943)
keep<-rowSums(cpm(DB)>1)>=2
DB<-DB[keep,keep.lib.sizes=FALSE]
DB<-estimateDisp(DB,design)

#So need a GLM model and a design matrix
fit<-glmFit(DB,design)
#wt vs 5a 18 hpf
lrt.18.5a<-glmLRT(fit,coef=2)
tab1<-topTags(lrt.18.5a, n=Inf)
write.csv(tab1,file="18hpf control genes vs 5a.csv")
# and similar for the other conditions

## try a QL F-test which is preferred as it reflects the uncertainty in estimating the dispersion/ is more robust and reliable apparently
fit<-glmQLFit(DB, design)
qlf.18wtvs5a<-glmQLFTest(fit,coef=2)
topTags(qlf.18wtvs5a)
tab1a<-topTags(qlf.18wtvs5a,n=Inf)
write.csv(tab1a,file="qlf 18hpf wt vs 5a.csv")
#no I am still literally finding THOUSANDS of genes
RNA-Seq R • 3.1k views
ADD COMMENT
0
Entering edit mode

Link to image of volcano plot https://postimg.org/image/yev3egsuv/f99b03e7/ because it didn't show up in earlier comment.

ADD REPLY
0
Entering edit mode

The web page http://postimg.cc/image/yev3egsuv/ is blocked by the Internet Security Filter because This Websense category is filtered: Suspicious Content.

Can you try another host for the image?

ADD REPLY
0
Entering edit mode

Images can now be seen in the original post (at least I can see them).

ADD REPLY
2
Entering edit mode
7.7 years ago

You're missing a rather important part, namely a contrast. At the moment, you're just testing whether genes have counts greater than 0, which should lead to a very very skewed p-value and fold-change distribution.

ADD COMMENT
0
Entering edit mode

Thank you!

So to test for instance if there is a difference between the first set of samples and the second set of samples, I should use the contrast c(1,-1,0,0,0,0,0,0) because it would be testing if there is a significant difference between the control/first sample and the second sample? This does seem to stop the data from being skewed and makes the p-values much more reasonable. There's a high FDR unfortunately, but that seems like it might be because of the data itself rather than coding problems.

lrt.18.5aCONTRAST<-glmLRT(fit,contrast=c(1,-1,0,0,0,0,0,0))
tab1aa<-topTags(lrt.18.5aCONTRAST,n=Inf)
write.csv(tab1aa,file="18hpf wt vs tbx5a contrast.csv")`

This is how my design matrix is set up:

> design
   t18.wt t18.tbx5a t18.tbx5b t18.double t21.wt t21.tbx5a t21.tbx5b t21.double

1       1         0         0          0      0         0         0          0
2       1         0         0          0      0         0         0          0
3       1         0         0          0      0         0         0          0
4       0         1         0          0      0         0         0          0
5       0         1         0          0      0         0         0          0
6       0         1         0          0      0         0         0          0
7       0         0         1          0      0         0         0          0
8       0         0         1          0      0         0         0          0
9       0         0         1          0      0         0         0          0
10      0         0         0          1      0         0         0          0
11      0         0         0          1      0         0         0          0
12      0         0         0          1      0         0         0          0
13      0         0         0          0      1         0         0          0
14      0         0         0          0      1         0         0          0
15      0         0         0          0      1         0         0          0
16      0         0         0          0      0         1         0          0
17      0         0         0          0      0         1         0          0
18      0         0         0          0      0         1         0          0
19      0         0         0          0      0         0         1          0
20      0         0         0          0      0         0         1          0
21      0         0         0          0      0         0         1          0
22      0         0         0          0      0         0         0          1
23      0         0         0          0      0         0         0          1
24      0         0         0          0      0         0         0          1
attr(,"assign")
[1] 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"

improved volcano plot http://imgur.com/a/hulvx

ADD REPLY
0
Entering edit mode

Yup, that looks like the correct result (sorry the p-values aren't better).

ADD REPLY

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6