Question

biopython esearch not giving all children taxIDs

0

Entering edit mode

6.6 years ago

yarmda ▴ 40

I know similar questions have been posted before..

How To Retrieve All Sequences, From Ncbi, That Belong To A Specific Txid And Its Sub Txids?

C: Refseq Proteins For A Given Taxid

But, I am having trouble retrieving the children sequences of a given taxID.

For instance,

from Bio import Entrez
record = Entrez.read(Entrez.esearch(db='protein', term="txid1392[Organism]"))
record['IdList']

Returns just one list of protein UIDs for the Bacillus anthracis species at 1392, not the list for each organism that is below this taxon. Thus, Entrez.efetch only returns one set of protein sequences.

Dropping the [Organism] doesn't change this behavior. Am I missing something?

biopython ncbi entrez • 1.8k views

ADD COMMENT • link updated 5.5 years ago by tsrmhathesh • 0 • written 6.6 years ago by yarmda ▴ 40

0

Entering edit mode

Unfortunately, I think you might have to list all the child taxon identifiers explicitly - but try exploring the web interface for building an advanced query first in case that shows a better solution.

ADD REPLY • link 6.6 years ago by Peter 6.0k

0

Entering edit mode

Actually, I think it may have been an issue with a default retmax of 20.

ADD REPLY • link 6.6 years ago by yarmda ▴ 40

0

Entering edit mode

Oh good. I should have tried the example myself really to confirm my hunch. Thanks!

ADD REPLY • link 6.6 years ago by Peter 6.0k

0

Entering edit mode

hy ,i am currenly doing biopython yet in industry does it have influence and impotance in this era?

ADD REPLY • link 5.5 years ago by tsrmhathesh • 0

1

Entering edit mode

Hi, this comment is not appropriate to this (very old) thread.

If you wish to ask a question, please create your own thread. If you do, I strongly encourage you to search the forum first (since questions like this are asked often - and are of dubious usefulness). If you cannot find something that satisfies you, ask a question, but please add much more information and detail and make the question as specific as possible.

ADD REPLY • link 5.5 years ago by Joe 21k

score 2 · Accepted Answer · 2017-09-21

Your code is fetching only the UIDs from the first page. You need to provide retmax= parameter to fetch all records. See below corrected code,

 from Bio import Entrez
 record = Entrez.read(Entrez.esearch(db='protein', retmax=770094, term="txid1392[Organism]"))
 record['IdList']

I have put retmax=770094 as this taxon has 770094 protein records.

Alternative,

You can get the list of sequences for txid1392[Organism] from web NCBI also.

Go to NCBI Entrez search and type txid1392[Organism] and choose protein database from dropdown list (it will fetch all protein sequences for txid1392[Organism]
Go to send to button and send to file (You can choose FASTA format for downloading the sequences)