Question

How many genes to include for a GSEA analysis

1

Entering edit mode

7.0 years ago

moxu ▴ 510

This seems to be a simple question.

You have a list of genes from DEG analysis, with p-values, FDRs, & logFCs, etc. Previously, what I do for GSEA analysis is to filter in genes with FDR < 0.25 or 0.05, rank them by logFC (in other words, pre-rank the genes by logFC), and then do GSEA. Now I am wondering if this is a good way:

There might be too many genes (typically ~50%). Assuming usually
there are 4~5 pathways involved and each pathway has about 500 genes, then the top 2,000 genes might be enough to be included for GSEA
analysis.
Not sure if logFC is the best way to rank genes. Maybe
use -log(PValue) as the magnitude of the rank score and the sign of logFC as the sign of the sore? i.e., use sign(logFC) * (-log(PValue)) as the rank score?

Googled briefly but didn't find a convention.

Thanks.

rna-seq next-gen gene • 12k views

ADD COMMENT • link updated 2.7 years ago by sontiroy • 0 • written 7.0 years ago by moxu ▴ 510

0

Entering edit mode

Your first point is asking about a good threshold or filter for your gene list. Typically, this would depend on what you're interested in. For example, you could be interested in only the strongest effects and therefore take only the most extreme logFC. I could also imagine situations in which you are only interested in certain categories of genes, maybe because you have some prior knowledge. On the second point, you have to consider what the parameter used for ranking represents: logFC represents the strength of the effect while log(p-value) represents "unexpectedness". To me, effect strength is more relevant than p-value because, without any other information, I wouldn't trust a small variation even if it is associated with a small p-value. Another way of putting it is that statistical significance doesn't imply biological relevance but a strong effect is likely to have some biological impact.

ADD REPLY • link 7.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Please see my reply below to igor -- one of my experiences is that "true signals" (low p-values) should be weighted much more than "big signals" (large abs(logFC)s).

ADD REPLY • link 7.0 years ago by moxu ▴ 510

score 3 · Answer 1 · 2017-04-17

3

Entering edit mode

7.0 years ago

igor 13k

According to the GSEA documentation:

The GSEA algorithm does not filter the expression dataset and does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic.

And additionally in the wiki:

We hopefully will be able to devote some time to investigating this, but in the mean time, we are recommending use of the GSEAPreranked tool for conducting gene set enrichment analysis of data derived from RNA-seq experiments. In particular: Prior to conducting gene set enrichment analysis, conduct your differential expression analysis using any of the tools developed by the bioinformatics community (e.g., cuffdiff, edgeR, DESeq, etc). Based on your differential expression analysis, rank your features and capture your ranking in an RNK-formatted file. The ranking metric can be whatever measure of differential expression you choose from the output of your selected DE tool. For example, cuffdiff provides the (base 2) log of the fold change.

ADD COMMENT • link 7.0 years ago by igor 13k

0

Entering edit mode

The reason I created this post is that my recent experience hinted the -log(P) is more trust worthy than logFC. One of my projects is to find the differentially regulated pathways by a certain compound for a drug-resistant cell line. We treated the sensitive & resistant cell lines with two different compounds at two different concentrations besides DMSO. GSEA found the same up-regulated pathway for the two compounds: TNFA signaling pathway via NFKB. This makes a lot of sense to us -- and maybe to you as well because TNFA & NFKB are famous tumor related genes. What is interesting is that the genes of this pathway occur frequently with extremely low p-values (7 out of 20). While those with extreme logFCs don't have a single gene of this pathway (0/20) even after FDR < 0.05 filtering. Genes with extreme logFCs usually have relatively high p-values. This puts me think that maybe we should weigh "true signals" much more than "big signals".

ADD REPLY • link 7.0 years ago by moxu ▴ 510

1

Entering edit mode

It depends on how you are calculating fold changes. If you have an extreme outlier, it can push the fold change up or down by a lot. DESeq2, for example, performs "shrinkage" of the fold change to account for variance. In a way, the fold change has the significance built in in that case. I am not sure which other packages perform the same type of adjustment.

ADD REPLY • link 7.0 years ago by igor 13k

0

Entering edit mode

I used edgeR, which does bayes shrinkage as well. But, still ...

Too bad this website does not host images, otherwise I would be happy to upload some to demonstate.

ADD REPLY • link 7.0 years ago by moxu ▴ 510

0

Entering edit mode

actually it does hosts images.

ADD REPLY • link 4.9 years ago by Ömer An ▴ 260

0

Entering edit mode

Hi Moxu,

I think that whatever gene list we obtain is just after applying a filter to exclude false signals, assuming a certain threshold. Even though it will contain some false positives, we should do our biological analysis as though they are all real positives. Thus, when we examine LogFC, it makes far more biological sense to determine which genes are closely associated with our phenotype.

ADD REPLY • link 2.7 years ago by sontiroy • 0