What is the difference between this Biomart Code, Org.Hs.Db code and SQL code?
1
0
Entering edit mode
5.6 years ago

My aim is to get all the genes annotated to a Gene Ontology(GO) term in ENTREZ ID form. And currently I have 3 different solutions that achieve this. Below are my example solutions for Human and GO ID: 0005634(nucleus).

Biomart

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
gene.data <- getBM(attributes=c('entrezgene'), 
                   filters = 'go', 
                   values = "GO:0005634", 
                   mart = ensembl)

org.Hs.eg.db

library(org.Hs.eg.db)
gene_list <- data.frame(mget("GO:0005634", org.Hs.egGO2ALLEGS)[[1]])
print(gene_list)

running an SQL query on the GO servers

 SELECT
 gene_product.symbol AS gp_symbol
 FROM term
 INNER JOIN association ON term.id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN db ON (association.source_db_id=db.id)
 WHERE
 term.acc = 'GO:0005634'
 AND
 species.ncbi_taxa_id="9606";

you can try running the same code in this link . The first two solutions give me entrez ids but the last one gives gene symbol and I think there is no way to get entrez id from gene ontology(please correct me if I am wrong). So I use the mygene library in python to convert the gene symbols to entrez ids. (I search these gene symbols in both the symbols scope and the alias scope).

When I compare the entrez gene ids I obtained with each other I get this:

venn diagram

So my question is:

Why do these return such different results?

Another problem that I have is:

converting all gene symbols into gene ids

Using the mygene python library with Human and Nucleus I am able to get 4955 entrez gene ids and I am left with 980 gene symbols that couldn't be converted into entrez ids. Below are 6 gene symbols that the mygene library is not able to convert into entrez ids

A2RUA4', 'B3KY84', 'ENSP00000368480', 'OTTHUMP00000081030', 'Q14547', 'XP_933608

I mentioned more about that problem in this link but couldn't reach a conclusion.

Any help on my problems would be appreciated and I am also open to new solutions.

R biomart sql gene ontology entrez • 1.7k views
ADD COMMENT
0
Entering edit mode

tagging: Mike Smith

ADD REPLY
2
Entering edit mode
5.6 years ago

The first one uses Ensembl's biomart whereas the third one directly queries the GO MySQL database. Ensembl has a different set of annotations than GO. It used to be doing its own annotations but maybe now uses GO but with a lag in time which means an older version than the current GO. org.Hs.eg.db also uses a specific version of GO (in the current version of the package, GO from 2018-03-28).
In short: Different databases = different results.

ADD COMMENT

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6