How to retrieve nucleotid sequence from gene ids of ncbis "gene" data base?
1
0
Entering edit mode
6.3 years ago
john ▴ 130

Hello people

I would like to retrieve all sequence from a set of gene entrys of the NCBI data base "gene".

As an example I would like to retrieve all sequence of this query:

"txid511145[Organism:noexp] "

URL: https://www.ncbi.nlm.nih.gov/gene/?term=txid511145%5BOrganism%3Anoexp%5D

The only way I found so far is to download the the full genome to which the genes refer and grep all the sequence locally according to the length and starting position. Is there a better way?

Thanks

gene id ncbi • 3.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I do actually, but cant figure out how to.

ADD REPLY
0
Entering edit mode

What query did you try?

ADD REPLY
0
Entering edit mode

in my question is the query and also the url

ADD REPLY
0
Entering edit mode

I meant esearch/eutils query.. Check the link and try to build a eUtils query.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Sorry, my bad! What I meant was to use Entrez Direct command line tools: https://www.ncbi.nlm.nih.gov/books/NBK179288/

You can make complicated queries using that and can chain queries where the results from one query are fed to the next one. See if that helps. Also a youtube video is here

ADD REPLY
0
Entering edit mode

Oh thats a nice one. I didnt found that. Until now I used the r package rentrez for the calls. But it seems this one is more powerful, maybe. But unfortunately I still do not get what I want. Here is my query

esearch -db gene -query 'txid511145[Organism:noexp]' | efetch -format fasta

But this again just returns me the entries of the gene db. Example:

ID: 945651 99. dnaC DNA biosynthesis protein [Escherichia coli str. K-12 substr. MG1655] Other Aliases: b4361, ECK4351, JW4325, dnaD Annotation: NC_000913.3 (4600238..4600975, complement)

I could use the last line to get the corresponding fasta locally. But I would like to know if the server of ncbi I would do this for me or not.

ADD REPLY
2
Entering edit mode
6.3 years ago
Joseph Hughes ★ 3.0k

I think something like this using the elink function of eutilities should work:

esearch -db gene -query 'txid511145[Organism:noexp]' | elink -target nuccore | efetch -format fasta

As you know what the accession number of you genome is, you are much better starting from that. The following retrieves all coding sequences for the reference genome

esearch -db nuccore -query 'NC_000913.3' | efetch -format fasta_cds_na

This gives a total of 4319 sequences. Accession number NC_000913.2 is an older version of the accession number.

ADD COMMENT
0
Entering edit mode

Oh that looked so good. But the result is really not what I hoped it would be. The returned fasta just has 38 entries and this e coli strain should have 4516 genes. Also one of the entries is the whole genome. Not really know what this results refere to any how. As all the genes map only to two entries in the nucore db "NC_000913.3" and "NC_000913.2".

ADD REPLY
0
Entering edit mode

Does your starting point have to be the taxid? The problem with starting with a taxid is that it is not very precise. It sounds like you know the two full reference genomes that you want to extract genes from so why not start from those accession numbers?

ADD REPLY
0
Entering edit mode

That works for me. Yeah the starting point is the organism so the txid. But thats okay. So I check for the best genome and work with this further. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6