Question

Enrichment analysis with a greater number of genes than differential expression analysis

0

Entering edit mode

3.3 years ago

rnaka09 ▴ 30

Hello guys, I'm analyzing the differential expression genes in 4 groups, with data from RNAseq. I made this analysis from the "Deseq2" package and this is an result example of 1 group:

#out of 20172 with nonzero total read count
#adjusted p-value < 0.1
#LFC > 0 (up)       : 20, 0.099%
#LFC < 0 (down)     : 2, 0.0099%
#outliers [1]       : 15, 0.074%
#low counts [2]     : 9575, 47%
#(mean count < 41)

After that, I went to do the enrichment analysis with the "goseq" package, with the following commands:

#Nelore - atividade
all_genes_N_atv=row.names(res_N_atv)
de_genes_N_atv=all_genes_N_atv[which(res_N_atv$padj<=0.1)]
de_N_atv_vector=c(t(de_genes_N_atv))
all_genes_N_vector_atv=c(t(all_genes_N_atv))
gene.vector.N.atv=as.integer(all_genes_N_vector_atv%in%de_N_atv_vector)
names(gene.vector.N.atv)=all_genes_N_vector_atv
table(gene.vector.N.atv)

pwf_N_atv=nullp(gene.vector.N.atv, "bosTau4",'ensGene')
go_N_atv=goseq(pwf_N_atv,'bosTau4','ensGene')
head(go_N_atv)
enriched_N_atv=go_N_atv$category[go_N_atv$over_represented_pvalue<=0.1]

capture.output(for(go in enriched_N_atv) {
  print(GOTERM[[go]])
  cat("--------------------------\n")
}, file = "enriched_N_atv.txt")

First, the organisms supported by this package list 5 genomes of the organism I study (beef cattle). The reference genome I used was more current, ARS-UCD1.2, obtained from Ensembl. However, I don't know if I really chose the right Genome for the analysis, because I didn't find which of the goseq options is the one I need, so I used "bosTau4" and "ensGene".

Second, enrichment analysis shows me more genes than what I got from differential expression analysis:

> length(enriched_N_atv)
[1] 257

> table(gene.vector.N.atv)
gene.vector.N.atv
    0     1 
27585    22

What could have happened?

goseq R RNA-Seq • 866 views

ADD COMMENT • link 3.2 years ago by rnaka09 ▴ 30

1

Entering edit mode

The goseq output should give you the gene sets, which are significantly enriched in your differntially expressed genes, so length(enriched_N_atv) should give you the number of significant gene sets, not genes? Can you show the content of enriched_N_atv?

The advantage of GOseq is, that it is able to correct for the bias in the different length of genes of the different GO-terms. So since you already use GOseq, you could include this bias correction in the analysis during execution of the nullp function.

ADD REPLY • link 3.2 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Hello!

Thanks for the info!

enriched_N_atv has:

> summary(enriched_N_atv)
   Length     Class      Mode 
      257 character character

> str(enriched_N_atv)
 chr [1:257] "GO:0001802" "GO:0001803" "GO:0001805" "GO:0001810" "GO:0001812" "GO:0001820" ...

ADD REPLY • link 3.2 years ago by rnaka09 ▴ 30

1

Entering edit mode

Hi, so as you see, the result object enriched_N_atv holds not gene names, but gene ontology term IDs, like "GO:0001803". A gene ontology term is basically a list of genes, which belong for example to a certain biological process. In case of "GO:0001803", its longer name is "regulation of type III hypersensitivity", so it holds genes, which were in the past described to play a role in this process. Using the GOseq package correctly, this result would tell you, that genes e.g. from the gene list "GO:0001803" were significantly enriched in your list of significantly expressed genes. It seems weird, though, that you get 257 differential GO terms with only 22 differentially expressed genes.