Too many differentially expressed genes in RNA-seq?
3
4
Entering edit mode
6.2 years ago

Hello all,

I have some RNA-seq data from an unusual biological system, and as a result I need to look for genes that are differentially expressed genes between two samples, one of which was super stressed. My DESeq2 results said that more than 70% of genes had an adjusted p-value below 0.1! I was expecting a lot of differentially expressed genes, but this seems very high. Is there such a thing as 'too many' DEGs? Is there a way I can test whether my data fall into this category? Does anyone know of any solutions to this issue? Thanks in advance!

RNA-Seq DESeq2 • 8.4k views
ADD COMMENT
2
Entering edit mode

Note that one of the first steps of the DESeq2 process is library normalization, where DESeq normalizes based on assuming that the genes whose expression is the median is expressed the same between your samples. If this fundamental assumption is wrong, it's going to affect the results you get. You might need to think about some other kind of normalization.

ADD REPLY
0
Entering edit mode

Thanks for this response! I liked what Tiago mentioned about cutoffs so I made a volcano plot. In this plot, the blue is everything with padj < 1e-10, green is everything with log2 fold change > 5, and yellow is both. Would be really interested to hear anyone's thoughts on this. Not sure if it's ideal to just choose cutoffs that 'look good'. I'll look into other normalization methods like you suggest, though admittedly I'm not sure where to start...

ADD REPLY
1
Entering edit mode

Here is a little tip for your volcano plot: Make a column with information about who is DEG, and who is not

DESeq.Result$super <- DESeq.Result$log2FoldChange >  1  & DESeq.Result$padj < 0.01
DESeq.Result$sub   <- DESeq.Result$log2FoldChange < -1  & DESeq.Result$padj < 0.01
DESeq.Result$threshold <- as.factor(abs(DESeq.Result$log2FoldChange) > 1 & DESeq.Result$padj < 0.01)

volcano_plot <- ggplot(data=DESeq.Result, aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(data=DESeq.Result, size=1, colour="gray") +
  geom_point(data=DESeq.Result[DESeq.Result$super==TRUE, ], size=2, colour="#CC0000") +
  geom_point(data=DESeq.Result[DESeq.Result$sub  ==TRUE, ], size=2, colour="#000099") +
  xlab("log2 fold change") +
  ylab("-log10 p-value adjusted") +
  ggtitle("My Title")+
  scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16, colour="black"),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  theme(plot.title = element_text(hjust = 0.5))
volcano_plot
ADD REPLY
3
Entering edit mode
6.2 years ago
h.mon 35k

This question has been asked before, see previous discussion at Question: Differential expression for two very different samples. As swbarnes2 said, one assumption of DESeq2 and edgeR is that not too many genes are differentially expressed. edgeR allows the user to mitigate this issue with the argument logratioTrim in the function calcNormFactors() - see here. I don't know DESeq2 has a similar option - I believe it doesn't.

Regarding cut-offs: the author of DESeq2 argues that instead of applying a cut-off after a regular test for differential expression (that is, a test against the null hypothesis of zero LFC), it is better to test if the LFC is above a certain threshold. He argues that the later approach keeps the nominal fdr, while the former doens't. See discussions on DESeq2 vignette and paper, and at the BioConductor forum here and here. Use the arguments lfcThreshold and altHypothesis from results() to set the desired LFC threshold and test.

ADD COMMENT
0
Entering edit mode

Whoops, I don't know how I missed that earlier post. I will give the edgeR docs a read. I also like what you suggested about the L2FC cutoff--it seems like an intuitive way to compare very different samples. Thank you!!!

ADD REPLY
2
Entering edit mode
6.2 years ago
tiago211287 ★ 1.4k

It depends on your target organism and tissue. Also, adjusted p-values(padj) of 0.1 aren't big. 0.1 = 10%. In general, a good cutoff for considering a gene as a DEG is less than 0.01 (1%). Better to visualize it in the minus (-log10(padj)) scale so that anything between 0 and 1 will be a non-DEG, and everything bigger than 2 may be considered as a DEG. Keep in mind that a better approach is to use a cutoff for the fold change as well. Considering only the padj may not be relevant.

ADD COMMENT
1
Entering edit mode
6.2 years ago
JJ ▴ 670

I agree it depends on the biology behind it and usually you know if you expect a lot or not. I also agree with Tiago that the p value cutoff should be set stricter. Standard cutoff is 0.05. Additionally you could set a cutoff for the fold change and/or for the FPKM/RPKM values. Standard cutoff for log2FC is usually < -1 and > 1.

ADD COMMENT
0
Entering edit mode

Yes - I strongly agree with having a fold-change filter.

While it is a slightly different topic, I thought this was an issue when the number of differentially expressed genes was approaching the number of genes in the genome (which would lead to inappropriate power calculations): https://disqus.com/by/charleswarden/

ADD REPLY

Login before adding your answer.

Traffic: 2526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6