Question

How To Retrieve Genbank Records With Range Of Accession Numbers

6

Entering edit mode

13.5 years ago

Daniel Standage 4.1k

A publication I was reading provided two ranges of GenBank accession numbers for supplementary data.

The ESTs from GR_Ea and GR_Eb were deposited in GenBank under accession nos. CO069431–CO100583 and CO100584–CO132899.]

If I search by a single accession number in GenBank I have no problem pulling up a record, but I obviously don't want to do this for thousands of EST records. Is there a way that I can provide a range of accession numbers (as above) and retrieve all these records simultaneously from GenBank? I am using GenBank's web interface right now, but I also wouldn't mind knowing how to do this on the command line as well.

Thanks!

genbank • 39k views

ADD COMMENT • link updated 24 months ago by cmdcolin ★ 3.8k • written 13.5 years ago by Daniel Standage 4.1k

Ram · Answer 1 · 2010-11-03

13

Entering edit mode

13.5 years ago

Rm 8.3k

Try this

http://www.ncbi.nlm.nih.gov/nucest?term=CO069431:CO100583[accn]

or can use with list of acc numbers in a file to upload.

NCBI Batch download: http://www.ncbi.nlm.nih.gov/sites/batchentrez?db=Nucleotide

for EST: use db = nucest

http://www.ncbi.nlm.nih.gov/sites/batchentrez?db=Nucest

ADD COMMENT • link 13.5 years ago by Rm 8.3k

2

Entering edit mode

Yet another pearl from the sea of NCBI...

ADD REPLY • link 13.5 years ago by Khader Shameer 18k

1

Entering edit mode

cool ! I didn't known this 'accn' field !

ADD REPLY • link 13.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Useful link: How To: Download a large, custom set of records from NCBI: http://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/

ADD REPLY • link 13.5 years ago by Rm 8.3k

0

Entering edit mode

Great. This is what I was looking for. The filters are powerful...now I just need a reason to take the time to learn them!

ADD REPLY • link 13.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genehelp.html#display_table

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.4 years ago by Rm 8.3k

Ram · Answer 2 · 2010-11-03

You could try the following shell script (only your first range here:)

j=69431;
while [ $j -le 100583 ]
do
   acn=`printf "CO%06d" $j`;
   curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${acn}&rettype=fasta"
   j=$((j+1))
done

>gi|48738912|gb|CO069431.1|CO069431 GR__Ea26A01.r GR__Ea Gossypium raimondii cDNA clone GR__Ea26A01 3', mRNA sequence
GTGACCAGAGGCTACTTGATGCTAGCCTCTCGAGACCTCAGGCGTGCTAGAGCCGCAGCTCTCAACATCG
TCCCGACTTCACTGGTGCGGCAAAGGCCGTGGCCCTTGTACTCCCTACTCTCAAAGGCAAACTTAACGGC
ATCGCATTGCGTGTACCAACACCAAATGTGTCGGTGGTGGACCTAGTGGTCCAGGTTTCAAAGAAGACGT
TTGCTGAAGAGGTGAACGCTGCTTTCAAAGAGAGTGCAGAGAAAGAGCTACAGGGTATACTTTCAGTGTG
TGAAGAACCCCTCGTTTCAGTGGACTTCAGGTGCTCTGATGTGTCCTCCACCGTTGATGCATCACTCACC
ATGGTCATGGGAGATGACATGGTTAAGGTGATTGCTTGGTATGACAATGAGTGGGGCTACTCTCAAAGGG
TTGTGGATTTGGCTGACATTGTTGCCAATAGCTGGAAGTGATTTCAATGTGCTATACATACATATATGCA
TAACAATGTCACCGATGGTTGATTTTTGCATGCTCACTTCATTTTTATTCTTTCGGCTTCAGCAATTTCT
CATTTTGTCAAGGCTACTATATAATCTGTAATGTAATGTGGGATACATACATTCTCTAATATGCTTATGG
AATAAA

>gi|48738913|gb|CO069432.1|CO069432 GR__Ea26A02.f GR__Ea Gossypium raimondii cDNA clone GR__Ea26A02 5', mRNA sequence
AAAAAAAATTGGCCCTTTTTTTTAAAAAAAAGAGAAAAAGGGTCTTTGCCCCCAAAAAAAAAACCCCCCA
GGAATTTTTTCCCAAAATTCGGGGGACCCCCAAAAATTAAACAGGGAAATTGGCAATTTTACCCCCCCCC
CCCCCCCGGGGGGGGAAATTTAAGGGGAAAAAACCCAAAACAAAAGGGGGGCCCCCGGGTGGGGGGGGGA
CCCAATTCAGGACCCCCCCCCTCGGGGGGTCAAAAACCCGGGTTAAAAAACTTAAGAAACCCCTTTCCCA
GTTTCAGGGAAAATTTCTCCCCCCTTTTCGGGGGCTTCATTGGCTTTTTCAGCAGGGGGAAAGACATTTT
CCCATTCTTCCCTTCCAAAAAAAAACCCCGGCCCAAATTGGGGGGCCCCCCGCACCTGTCAAGGGGGGCA
CCAGGGGGCGGGCCCAGGGTTTCTTTAAAAAAAATGGGCAAAAAGGGGAAAGCTAATCCGGGCCCCCTAA
ACCCAAAAGCTTGTTTCCCTGGCCCCCC

score 1 · Answer 3 · 2017-05-09

1

Entering edit mode

6.9 years ago

cmdcolin ★ 3.8k

You can use ncbi edirect tools (brew install brewsci/bio/edirect) and run something like

cat file_with_ids.txt | while read p; do echo $p; esearch -db nucleotide -query $p | efetch -format fasta > $p.fasta; done;

or more simple

cat file_with_ids.txt | while read p; do echo $p; efetch -db nucleotide -id $p -format fasta > $p.fasta; done;

I mention both just because I have seen seen the esearch piped to efetch in ncbi docs elsewhere, but if you have the ID it seems easier to just pipe the ID directly

Note that you might also need to manually install cpan Mozilla::CA since the homebrew doesn't seem to handle that properly

ADD COMMENT • link 24 months ago by cmdcolin ★ 3.8k

1

Entering edit mode

Thanks for the command. It was very helpful!!

ADD REPLY • link 5.6 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

HI Colin,

I provided the list of ID's in the text file, it's not downloading the files. Were do I get the index file ?

bash fetch.sh

Missing idxfile for option -i.

EFETCH - retrieve entries from sequence databases.

Synopsis: efetch -options [database:]<query>

Databases: SWissprot/SP, PIR, WOrmpep/WP, EMbl, GEnbank/GB, ProDom, ProSite

Options: -a Search with Accession number -f Fasta format output -q Sequence only output (one line) -s <#> Start at position # -e <#> Stop at position # -o More options and info...

-D <dir>      Specify database directory
-H            Display index header data
-p            Display entrynames in search path
-r            Print sequence in 'raw' format
-m            Fetch from mixed mini database
-M            Mini format output
-b            Do NOT reverse the order of bytes
                          (SunOS, IRIX do reverse, Alpha not)
-d <dbfile>   Specify database file (avoid this)
-i <idxfile>  Specify index file (avoid this)
-l <divfile>  Specify division lookup table (avoid this)
-B <database> Specify database (archaic)
-A            Only return entryname for accession number
-n <name>     Give the sequence this name
-x            Don't require query to match entry's name exactly (avoid)
-w            For Wormpep: also fetch cross-referenced SwissProt entry
-h            shows this help text

Environment: SWDIR = SwissProt directory - database and EMBL index files PIRDIR = PIR -- " -- WORMDIR = Wormpep -- " -- EMBLDIR = EMBL -- " -- GBDIR = Genbank -- " -- PRODOMDIR = ProDom -- " -- PROSITEDIR = ProSite -- " -- DBDIR = User's own -- " -- (fasta format)

SEQDB database file (default SwissProt) SEQDBIDX index file DIVTABL division lookup table

Ex. setenv DBDIR /pubseq/seqlibs/embl/

Note that Prodom family consensus seqs can be fetched by PD:_#

by Erik Sonnhammer (esr@sanger.ac.uk) Version 2.1,

ADD REPLY • link 24 months ago by sunnykevin97 ▴ 980

0

Entering edit mode

hi @sunnykevin97 what command did you run (e.g. what is fetch.sh?). in my post i don't use -i also, i use -id though

ADD REPLY • link 24 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

cat file_with_ids.txt | while read p; do echo $p; esearch -db nucleotide -query $p | efetch -format fasta > $p.fasta; done;

ADD REPLY • link 24 months ago by sunnykevin97 ▴ 980

0

Entering edit mode

I couldn't tell you without more info: try to give as much info as possible when asking questions. this saves everyone time. for reference, this works for me esearch -db nucleotide -query CO069432.1|efetch -format fasta and fundamentally, all my command is doing is running that in a loop

ADD REPLY • link 24 months ago by cmdcolin ★ 3.8k

Ram · Answer 4 · 2010-11-03

Pretty much the same answer as in a previous question, Downloading Fasta Files

# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
  [0, 100],
  [1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs