RNAseq: What's the accurate way to test whether a set of N genes is differentially expressed?
1
0
Entering edit mode
5.0 years ago
francois ▴ 80

In RNAseq data, I want to test statistically whether a set of 17 genes are differentially expressed in a condition. I have two conditions: control treated/drug treated. My quick & dirty way of looking at it was: I computed up/downregulation vs control by simply dividing the counts in drug treated by the counts in control treated. I thus have 17 fold-change for my 17 genes (eg. 0.8, 1.2, etc.). I averaged these fold-changes (average is 0.85, so these genes seem to be downregulated as a set in average) To have an idea whether this result had anything surprising, I computed the same average for 10,000 random sets of 17 genes. 90% of the random sets are over 0.85, 10% are below.

What's the formal way of doing this?

RNA-Seq • 1.3k views
ADD COMMENT
1
Entering edit mode

The formal way is to test all genes at the same time, there are tools like edegR, DESeq2 or limma voom which are designed for this kind of analysis. Read the manual and it will guide you thru the analysis.

ADD REPLY
0
Entering edit mode

Sure, but I believe that is a different question. Why would I not be able to specifically look a set of genes and ask whether they are down/upregulated as a set?

ADD REPLY
0
Entering edit mode

Because RNA-seq data gives info on all genes. If you are interested in a panel you should use qPCR for example.

ADD REPLY
0
Entering edit mode

The best strategy, in my opinion, is to do DE analysis first with e.g. edgeR or limma voom, followed by roast analysis which tests whether your 17 genes are enriched in the set of downregulated genes.

ADD REPLY
0
Entering edit mode

Because genes in an RNA-seq experiment are not independent measurements. You should use one of the established tools Benn mentioned for both count normalization and differential analysis. Did you do any normalization of the counts in your approach? Using the above frameworks also allows you to claulcate significances based on the expression levels and inter/intra-sample variance which is preferred and recommended. As a review I would reject such a hand-made approach you propose as it is not standardized and not really reliable. Lowly-expressed genes tend to have high (and artificially high) fold changes.

ADD REPLY
0
Entering edit mode

You need data from all the genes to normalize properly. Manipulating raw counts without normalizing is a really bad idea.

ADD REPLY
1
Entering edit mode

The sequencing depth may vary between groups, so plain read-counts are misleading. If your control had better RNA handling, because they were processed before lunch, it may have more reads overall, and then all genes appear downregulated. Tools made for differential expression know this and account for total reads by 'normalizing' the readcounts into something else. Try DESeq2 or edgeR to get per-gene information. For evaluating if the set is wholly shifted, I like your random sets method, but be sure to account for both up and downregulation, maybe look at abs(x-1). And 10K is not a large number of 17-sets (there's >10^56 ways to choose 17 from 15K genes, so try to run a few million).

ADD REPLY
0
Entering edit mode
5.0 years ago

You're looking for GSEA (gene set enrichment) types of methods. Note that these may not have fold-changes, since that often has no meaning.

ADD COMMENT

Login before adding your answer.

Traffic: 2644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6