Question

Input GSEA Pre-ranked list

5

Entering edit mode

8.8 years ago

stevenlang123 ▴ 210

Hey y'all

I'm currently trying to run GSEA using a pre-ranked gene list but I'm not sure if my input file is correctly formatted, because my results seem to be mostly insignificant.

So my input looks something like this (roughly 16,000 genes):

Where my ranking statistic is the negative log of the p-Value obtained through an association test.

GENE      neg_log_Chi_permutation
ARHGAP4   0.928986
C16orf3   1.496821
HOPX      0.975562
FAM3D     1.132781
HTR2C     1.276158
UGCG      0.064802
VPS13D    0.123508
VWF       0

My results have over 300 gene sets shown to be enriched, but many of them have a FDR p value of close to 1, with a high NES value. What could I be doing wrong?

Best,
Steven

gene next-gen RNA-Seq genome software • 29k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by stevenlang123 ▴ 210

2

Entering edit mode

In the original paper describe GSEA, you use FDR less than 0.25 actually. Additionally, pay attention to p=0 when you -log(p).

ADD REPLY • link 8.8 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

I do have several instances of p=1, and subsequently instances where genes are ranked the same. How should that be reconciled?

ADD REPLY • link 8.8 years ago by stevenlang123 ▴ 210

5

Entering edit mode

Just recognised an error in your methods. You should combine the sign of logFC and -log of the p-Value (you ranked both up and down DE genes in the top of the rnk file). Or rank based on other metric, like logFC, t statistic. Additionally, GSEA use all the genes not DE genes.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

Hi Zhilong,

Regarding "pay attention to p=0 when you -log(p)", should those genes with p=1 be filtered out prior to the GSEA analysis?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by sebastiangeorge01081983 • 0

0

Entering edit mode

I suggest that all the genes /probes should be included for GSEA input.

ADD REPLY • link 8.6 years ago by Zhilong Jia ★ 2.2k

Ram · Answer 1 · 2015-07-13

7

Entering edit mode

8.8 years ago

GouthamAtla 12k

You should all the genes from your dataset and rank them. Here is a nice post on ranking the DE genes for GSEA analysis.

http://genomespot.blogspot.com.es/2014/09/data-analysis-step-8-pathway-analysis.html

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by GouthamAtla 12k

Ram · Answer 2 · 2015-07-13

3

Entering edit mode

8.8 years ago

andrew ▴ 560

I'm not an expert on GSEA, but there are at least two possibilities that I can see worth considering.

Your data is not significant. I know it's hard to believe, but it happens.
It would appear that if your input list contains roughly 16k genes, you are covering almost all of the protein coding genes, and thus, there is very little room for "enrichment". However, the "rank" of your gene list is supposed to affect this. But this is one of the classic limitations with any enrichment analysis.

Others may have alternative points to consider.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.8 years ago by andrew ▴ 560

0

Entering edit mode

Thanks for your feedback. In the case of the later, can the pre-ranked gene list be truncated? Or does posterior modification bias the experiment?

ADD REPLY • link 8.8 years ago by stevenlang123 ▴ 210