Biostar Beta. Not for public use.
Question: Fetching Genbank Entries For List Of Accession Numbers.
7
Entering edit mode

I have a looong list of accession numbers, for which I need to fetch genbank entries. Usually, I used to use epost-efetch workflow for long lists. But now I can not, because epost doesn't accept accession numbers as IDs.

So the question is:

a) is there a way of batch downloading using accession numbers? if not b) is there a way of mapping Accession number to normal ids?

ADD COMMENTlink 6.9 years ago Sanjarbek Hudaiberdiev • 90 • updated 3.7 years ago ifreecell • 170
Entering edit mode
0

can you give some example of your accession number? where is it from?

ADD REPLYlink 6.9 years ago
Leszek
4.0k
Entering edit mode
0

Ex: A22237,A22239,A32021,A32022,A33397 Those are accessions from NCBI. When you post them using epost, it gives this error: "IDs contain invalid characters which was treated as delimiters." So it appears to me that epost doesn't accept non-numeric characters for ID field. I tried to change the letters to their ascii codes, didn't help.

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
Entering edit mode
0

One more thing that I noticed today: All the BioXXX libraries just stop when they get error from epost. But I noticed that along with error, epost still returns WebEnv and query_key. But what it does is that, it takes the accession number, trims out the non-numeric characters, and searches for resultant GID. So, A22237 turns to 22237. Don't know what to do. Such a tiny problem taking up lot's of time.

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
8
Entering edit mode

Python 3 + BioPython with command-line interface. Basically, a less hardcoded and flexible version of Leszek's code, that doesn't fail when the list of accessions is so big, that GI retrieval results in HTTP Error 414: Request-URI Too Large

#! /usr/bin/env python3


import argparse
import sys
import os

import Bio.Entrez


RETMAX = 10**9
GB_EXT = ".gb"


def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        default="some_email@somedomain.com",
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)


def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]


def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
            else:
                buffer.append(line)
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list = Bio.Entrez.read(query_handle)['IdList']

        # get GB files
        search_handle = Bio.Entrez.epost(db=db, id=",".join(gi_list))
        search_results = Bio.Entrez.read(search_handle)
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)


def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)


def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
        os.makedirs(op_dir)
    dbase = args.database
    Bio.Entrez.email = args.email
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)


if __name__ == "__main__":
    main(sys.argv[1:])
ADD COMMENTlink 3.8 years ago Eli Korvigo • 150
Entering edit mode
0

If I want to download the full genbank file of my query id , I found this scripty can't work.and then ,I set the rettype to genbank , It doesn't work . So ,what should I do ? I want to download thousands of genome files

ADD REPLYlink 3.5 years ago
mark900830
• 0
Entering edit mode
0

Do you mean that the file has no sequence in it? There are some entries in GenBank that do not provide sequences, only annotation and metadata. If it's not the case, then I'll gladly help you and update the code. Waiting for your feedback.

ADD REPLYlink 3.5 years ago
Eli Korvigo
• 150
Entering edit mode
0

your code is pretty cool . I mean I want to download complete genome file . I just don't know how to select the rettype . I have solved the issue . If want to download complete genome file (genbank full), the rettype should be gbwithparts(rettype="gbwithparts") ,which means genbank file with protein sequences. Thank you so much ,and thanks for your nice code.

ADD REPLYlink 3.5 years ago
mark900830
• 0
Entering edit mode
0

Hi, probably little off topic but I wasn't able to find anywhere how to pass seq_start and seq_stop optional arguments to list of queries. What I mean is, when, like in this script, you search batch of accs, obtain web-env key to easy e-fetch what you want in one step. Now e-fetch accepts seq_start and seq_stop controlling what part of sequence you get, but how to pass this information to e-fetch for multiple sequences is a mystery to me. (From what I've tried it returns only first entry correctly omitting others or it omits seq_start and seq_stop and return those entries.

Any suggestion would be appreciated. Thanks

ADD REPLYlink 3.4 years ago
massa.kassa.sc3na
• 20
Entering edit mode
0

Hi , Would it be possible to get the fasta sequence given a list of accessions instead of the genbank files ??

ADD REPLYlink 2.9 years ago
David
• 150
Entering edit mode
0

Yes, you need to change the rettype argument from gb to fasta

ADD REPLYlink 2.8 years ago
Eli Korvigo
• 150
5
Entering edit mode

You can try this:

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions. 

USAGE:
  python acc2gb.py A22237 A22239 A32021 A32022 A33397 > out.gb
or
  cat ids | python acc2gb.py > out.gb

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
  try:
    sys.stderr.write( " %9i %s          \r" % (i+1,acc))  
    handle = Entrez.efetch(db=db, rettype="gb", id=acc)
    #print output to stdout
    sys.stdout.write(handle.read())
  except:
    sys.stderr.write( "Error! Cannot fetch: %s        \n" % acc)  

EDIT
The same using epost:

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #print output to stdout
  sys.stdout.write(handle.read())
ADD COMMENTlink 6.9 years ago Leszek 4.0k
Entering edit mode
0

I usually use this way. But it's extremely slow when you want to download thousands of entries and sometimes falls to timeout. Using epost would be much better. But, apparently there's no way of doing that.

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
Entering edit mode
0

you can do epost easily with biopython - you just have to be sure to provide valid accessions, otherwise the script will crash... have a look here how to do it: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114

ADD REPLYlink 6.9 years ago
Leszek
4.0k
Entering edit mode
0

I'm having the same issue with epost not taking accession numbers but integer UIDs instead...

ADD REPLYlink 6.7 years ago
junyinglim
• 0
Entering edit mode
0

Look at edited option: you can provide any id that is accepted by NCBI and the script will get UIDs of these automatically and print fasta...

ADD REPLYlink 6.7 years ago
Leszek
4.0k
Entering edit mode
0

This works fantastic! Just couldn't think myself of searching by accessions and getting ['IdList'] back. Thanks Leszek.

ADD REPLYlink 6.6 years ago
Sanjarbek Hudaiberdiev
• 90
Entering edit mode
1

you can find more tricks in biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc110

ADD REPLYlink 6.6 years ago
Leszek
4.0k
Entering edit mode
0

@ Leszec I want to retrieve gene symbols using RefSeq protein(stored in text file) how can i do this using the program when i run the script saving it as .py its showing "SYNTAX ERROR" I use Python2.7 Help me to work on the script Thanks in advance

ADD REPLYlink 2.8 years ago
lakshmi.bioinformatics
• 20
5
Entering edit mode

I have successfully used wget to download more than 1000 plastid Genbank files using Accessions (with version, NC_018523.1). It's not fast, but works.

mkdir -p gb fa
cut -f 1 plastid.accs | while read -r acc
do
    wget -O gb/"${acc}.gb" \
       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=gb&sendto=on&id=$acc"
#    sleep 1s
#    wget -O fa/"${acc}.fa" \
#       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&sendto=on&id=$acc"
done

To map accession to gi, you could try

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz # ~800MB
gunzip nucl_gb.accession2taxid.gz
grep -f plastid.accs nucl_gb.accession2taxid | cut -f  1, 4 > Acc_GI.csv
ADD COMMENTlink 3.7 years ago ifreecell • 170
Entering edit mode
0

I have the same question . I try your method. It works.But If I want to download the full genbank file . what should I do? change the " dopt=gb"? what type should I use?

ADD REPLYlink 3.5 years ago
mark900830
• 0
Entering edit mode
0

Replacing dopt=gb with dopt=gbwithparts should work, but I have not try yet.

ADD REPLYlink 3.4 years ago
ifreecell
• 170
Entering edit mode
0

It's not fast, but works.

ADD REPLYlink 19 months ago
Prakki Rama
♦ 2.3k
2
Entering edit mode

The hit-it-on-the-head-with-a-hammer alternative is to batch download a whole section of genbank, then filter out the ones you want using bioperl which will take accessions.

i.e. something along the lines of

    foreach my $acc (@acc){
            if ($seq->accession eq $acc){
                       $fileout->write_seq($seq);           ##etc. etc. etc.
            }
ADD COMMENTlink 6.9 years ago Daniel ♦ 3.7k
Entering edit mode
0

Any suggestions on how to download the whole data of genbank?

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
Entering edit mode
0

There is the ftp if you want to batch DL ftp://ftp.ncbi.nlm.nih.gov/genbank/ But if you're only concerned about a certain taxa (prok, euk, human, whatever) you can just browser download it via http://www.ncbi.nlm.nih.gov/nuccore

ADD REPLYlink 6.9 years ago
Daniel
♦ 3.7k
0
Entering edit mode

I'm not too sure which accession number you mean and what normal ID you are referring to but http://david.abcc.ncifcrf.gov/conversion.jsp has the ability to map between various accessions (e.g. genbank, uniprot) and IDs (e.g. entrez, ensembl)

ADD COMMENTlink 6.9 years ago secretjess • 170
Entering edit mode
0

I'm working with GenBank. So, I need to map Genbank accession numbers to Genbank GI number. DAVID can't find the majority of the accessions that I submit. Moreover, it doesn't provide API, afaik.

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
Entering edit mode
0

Ah, sorry that wasn't helpful. This looks like a similar question to yours (in reverse) so maybe the solution will be more useful: http://www.biostars.org/p/50383/

ADD REPLYlink 6.9 years ago
secretjess
• 170
Entering edit mode
1

I had a look on that thread. The problem is not solved there. If I had GIDs first, and wanted to convert them to Accessions, then that would be viable via epost-efetch:). But can't do the other way around. Thank you for your suggestions.

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
0
Entering edit mode

two not so specific cents:

1) NCBI has a well documented but under-used collection of E-utilities you may want to try

2) Bio-perl has modules to fetch GenBank entries using different IDs.

ADD COMMENTlink 6.9 years ago Wen.Huang ♦ 1.2k
Entering edit mode
0

1) Yes, I like very much working with E-utils. It's the first time I got stuck with a problem. 2) True. There's no problem in fetching by accession number in one-request-at-a-time manner. But it's slow, due to 1/3 seconds limitation of NCBI. The problem comes when you want to try epost for batch downloading. Thank you for your suggestions!

ADD REPLYlink 6.9 years ago
Sanjarbek Hudaiberdiev
• 90
0
Entering edit mode

Depends on how long "long" is, but up to some length, you can even use the Entrez web interface to get these e.g. querying with "A22237 A22239 A32021 A32022 A33397" here http://www.ncbi.nlm.nih.gov gives you a link to this page:

http://www.ncbi.nlm.nih.gov/nuccore/A22237A22239A32021A32022A33397

and you can get these in a range of formats e.g. full genbank, fasta, etc., using the "Send to" functionality

ADD COMMENTlink 6.9 years ago aidan-budd 1.9k
0
Entering edit mode

The script provided by Leszek solves the problem.

ADD COMMENTlink 6.6 years ago Sanjarbek Hudaiberdiev • 90

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0