Biostar Beta. Not for public use.
Convert RefseqID to EntrezID
0
Entering edit mode
14 months ago
tomoya • 0

Hi, I have a set of genes with Refseq ids (ex. XM_020713141.1) and I want to convert it to EntrezID (ex. 101165603) for further analysis. I find similar question that said clusterProfiler is suitable for this purpose. [GeneBank accession 2 Entrez gene id ][1] However, I'd tried to find out Medaka (Oryzias latipes) annotation database in Bioconductor annotation packages because I use Medaka for research, there is only major species packages. Is there a way to access NCBI medaka annotation database to covert IDs? Or could you provide me some other method to solve this problem? I would be grateful if you could help me.

gene R • 358 views
ADD COMMENTlink
0
Entering edit mode

I am not sure, but you can try convert on DAVID or use Ensembl,http://grch37.ensembl.org/index.html

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
0
Entering edit mode

Sorry I don't think it has Oryzias latipes

ADD REPLYlink
0
Entering edit mode

Maybe UniProt Retrive/ID mapping is user-friendly and could help: https://www.uniprot.org/uploadlists/ You can submit list of RefSeq ID and then add a Entrez column to output table and then download it.

ADD REPLYlink
0
Entering edit mode

Moving this to a comment. Once you select RefSeq id as input, the only output option is UniProtKB id. So this may require two passes if it works at all.

ADD REPLYlink
0
Entering edit mode

Thank you for many suggestions! These are very useful for me and I successfully get almost all EntrezIDs by using biomaRt.

However, I still have some questions. Although I get almost all EntrezIDs, some are missing (results show NA). For example, XR_002293119.2 or XM_004081009.3 or XM_023961859.1. But when I try to search the EntrezID in NCBI website, I can find these EntrezID are 101158738, 101170377, 101155047.

I also tried to change attributes from entrezid to wikigene_id, but results were same (all show NA). Do you think this is because the difference of database version and is there a way to earn these EntrezID?

ADD REPLYlink
1
Entering edit mode

Since you are interested in Entrez IDs and starting with RefSeq accessions, why not use an NCBI tool? EDirect works fine for this.

printf 'XR_002293119.2\nXM_004081009.3\nXM_023961859.1' \
    | epost -db nuccore -format acc \
    | elink -db nuccore -target gene -name nuccore_gene \
    | esummary -format uid
101170377
101158738
101155047
ADD REPLYlink
0
Entering edit mode

I think it is because those genes are not part of the current Ensembl release (so either you wait for an update or use vkkodali's method): http://www.ensembl.org/Multi/Search/Results?q=XM_023961859

ADD REPLYlink
0
Entering edit mode

Thank you very much, both of you for your comments. I understand this is because these genes are not include in current Ensembl release and EDirect can solve this.

Thanks to vkkodali comment, I notice if I want to use EDirect by R, I can use reutils or rentrez. And I tried below command learning from above command,

multiple.ids <- c("XR_002293119.2","XM_004081009.3","XM_023961859.1")
refseq <- epost(multiple.ids, "nuccore")
refseq2 <- elink(refseq, dbFrom = "nuccore", dbTo = "gene", linkname = "nuccore_gene")
esummary(refseq2)

But I can't earn EntrezID like above.

I was wondering if you could help me again. Thank you.

ADD REPLYlink
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

esummary(refseq2)

You have to specifically ask for esummary -format uid. I am not sure how you do that in R.

ADD REPLYlink
0
Entering edit mode

Sorry for twice. I notice I need to reply like this.

I see. I need to specify the format, but I still struggling how to specify "uid" by reutils.

ADD REPLYlink
0
Entering edit mode

Have you checked to see what is in refseq2?

Edit: I see

> head(refseq2)
List of linked UIDs from database ‘nuccore’ to ‘gene’.
[1] "101170377" "101158738" "101155047"
ADD REPLYlink
0
Entering edit mode

Oh, I already get the results. Thank you for pointing out!

Sorry for several times, I still have one more question. The order of outputs is not same to inputs. So I'd also like to keep the order of outputs or extract both (refseqID and EntrezID in a same order) to find out which refseqID is link to specific EntrezID. I thought the option "correspondence" can keep the order, but it doesn't work.

ADD REPLYlink
3
Entering edit mode
14 months ago
benformatics • 870
ETH Zurich

Assuming the IDs you have are all derived from the Refseq predicted mRNA (e.g. XM_####).

R solution:

library(biomaRt)

mart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="olatipes_gene_ensembl",host="www.ensembl.org")
BM.info <- getBM(attributes=c('entrezgene','refseq_mrna_predicted'),mart = mart)

## make a function to remove weird numbers in your annotation names
trim.numbers <- function(name){ gsub("\\.[0-9]","",name) }

## match your trimmed refseq IDs to the dataframe and pull out the corresponding entrez id - example below
BM.info$entrezgene[match(trim.numbers('XM_020713141.1'),BM.info$refseq_mrna_predicted)]
[1] 101165603
ADD COMMENTlink
0
Entering edit mode
## how it can be used with multiple ids...
## select ids
multiple.ids <- c("XM_020704464","XM_011491436","XM_020702270","XM_023957409","XM_011476326")
## find entrez ids
BM.info$entrezgene[match(trim.numbers(multiple.ids),BM.info$refseq_mrna_predicted)]
[1] 101165143 101173426 101155179 101167210 101162526
ADD REPLYlink
0
Entering edit mode
14 months ago
vkkodali ♦ 1.1k
United States

Point-and-click

  1. Go to Batch Entrez and upload your list of RefSeq accessions. Choose 'Nucleotide' as the database. Click the 'Retrieve' button.
  2. Once you are in the results page, you will find 'Find related data' widget on the right hand side. From the drop-down list, choose 'Gene'. Click 'Find Items' button.
  3. If you just want the list of the unique identifiers, use the 'Send To' menu on the top right corner and choose 'UI List' as the format.

EDirect

Check out bit.ly/entrez-direct for more information. The command to use here would be this:

epost -db nuccore -input <input_file> -format acc \
    | elink -db nuccore -target gene -name nuccore_gene \
    | esummary -format uid

If you need to do this using R, you may want to check out packages such as reutils and rentrez.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3