Here's a transcriptome of a non-model organism. Comparing two conditions, kallisto generated ~6000 differentially expressed genes. KEGG metabolic pathways of a relative organism were used to classify DEGs and check pathways enrichment. These categories are relatively small (3-100 genes/set). Whilst DEGs number is so high, sets with PADJ is quite small - 10-15 sets are truly enriched (padj = 0.05). Same situation is appearing when we applied Fisher test.
We're newbies in the field and it feels like we've missed something. What do we do wrong? Sorry if I've missed any details.
Why is that wrong? What reasons do you have to expect having more gene sets enriched in your experiment? What I don't understand is how you used KEGG metabolic pathways of a relative organism "to classify DEGs and check pathways enrichment", first you classify DEGs based on your knowledge, and then you did a GSEA for each group of DEGs? Usually one test the enrichment of the genes in all the genes, independently if they are DEG or not.
We blastx'ed our transcripts against proteins of a relative organism, which genes are classified into KEGG pathways, so we can now group matching transcripts to these pathways. There's strong evidence for some groups to be enriched with DEGs based on previous experiments, but they ain't; also the states which are compared are radically different on a physiological level.