Question

What is the difference between this Biomart Code, Org.Hs.Db code and SQL code?

0

Entering edit mode

5.6 years ago

sinifdosyalari12h ▴ 20

My aim is to get all the genes annotated to a Gene Ontology(GO) term in ENTREZ ID form. And currently I have 3 different solutions that achieve this. Below are my example solutions for Human and GO ID: 0005634(nucleus).

Biomart

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
gene.data <- getBM(attributes=c('entrezgene'), 
                   filters = 'go', 
                   values = "GO:0005634", 
                   mart = ensembl)

org.Hs.eg.db

library(org.Hs.eg.db)
gene_list <- data.frame(mget("GO:0005634", org.Hs.egGO2ALLEGS)[[1]])
print(gene_list)

running an SQL query on the GO servers

 SELECT
 gene_product.symbol AS gp_symbol
 FROM term
 INNER JOIN association ON term.id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN db ON (association.source_db_id=db.id)
 WHERE
 term.acc = 'GO:0005634'
 AND
 species.ncbi_taxa_id="9606";

you can try running the same code in this link . The first two solutions give me entrez ids but the last one gives gene symbol and I think there is no way to get entrez id from gene ontology(please correct me if I am wrong). So I use the mygene library in python to convert the gene symbols to entrez ids. (I search these gene symbols in both the symbols scope and the alias scope).

When I compare the entrez gene ids I obtained with each other I get this:

venn diagram

So my question is:

Why do these return such different results?

Another problem that I have is:

converting all gene symbols into gene ids

Using the mygene python library with Human and Nucleus I am able to get 4955 entrez gene ids and I am left with 980 gene symbols that couldn't be converted into entrez ids. Below are 6 gene symbols that the mygene library is not able to convert into entrez ids

A2RUA4', 'B3KY84', 'ENSP00000368480', 'OTTHUMP00000081030', 'Q14547', 'XP_933608

I mentioned more about that problem in this link but couldn't reach a conclusion.

Any help on my problems would be appreciated and I am also open to new solutions.

R biomart sql gene ontology entrez • 1.7k views

ADD COMMENT • link updated 5.6 years ago by Jean-Karim Heriche 27k • written 5.6 years ago by sinifdosyalari12h ▴ 20

0

Entering edit mode

tagging: Mike Smith

ADD REPLY • link 5.6 years ago by GenoMax 141k

score 2 · Answer 1 · 2018-09-06

The first one uses Ensembl's biomart whereas the third one directly queries the GO MySQL database. Ensembl has a different set of annotations than GO. It used to be doing its own annotations but maybe now uses GO but with a lag in time which means an older version than the current GO. org.Hs.eg.db also uses a specific version of GO (in the current version of the package, GO from 2018-03-28).
In short: Different databases = different results.