Question

Refseq Genbank To Fasta Format Failing With Contig Fields

1

Entering edit mode

12.8 years ago

Käpt'N Silico ▴ 20

I used to generate FASTA out of my GenBank source files using a simple conversion script:

#!/usr/bin/env python
from Bio import SeqIO

def wrap( text, width=80 ):
 for i in xrange( 0, len( text ), width ):
  yield text[i:i+width]

if __name__ == "__main__":
  status = progress()
  for record in SeqIO.parse( sys.stdin, "genbank"):
    try:
      gi = record.annotations["gi"]
    except KeyError:
      gi = None
    accession = record.id
    desc = record.description
    seq = record.seq
    locus = record.name
    print ">gi|%s|emb|%s|%s| %s" % (gi, accession, locus, desc)
    for block in wrap( seq ):
      print block

When I changed the sequence files to newer versions some of the resulting FASTA file sequences were just filled with Ns. After closer inspection of the GenBank source files, it turns out that they have replaced the ORIGIN block

ORIGIN
  sequence...

with a CONTIG block, something like

CONTIG      join(BX640437.1:1..347356,BX640438.1:51..347786,...)

It looks like GenBank is now using cross-references to build sequence entries. In my understanding this would require a multi-pass scan of the file. Is there a way to resolve this using BioPython?

Data source: Since the sequences are from the RefSeq collection, I assume they should be self-contained. This problem occurs with the current RefSeq release 47 that you can download via FTP.

I was working with BioPython 1.52 and 1.57 (latest).

Thanks for your suggestions.

biopython genbank refseq • 4.1k views

ADD COMMENT • link updated 12.8 years ago by Peter 6.0k • written 12.8 years ago by Käpt'N Silico ▴ 20

0

Entering edit mode

I checked with the NCBI and they told me that this is intended and you are supposed to download both, FASTA and Genbank files.

ADD REPLY • link 12.8 years ago by Käpt'N Silico ▴ 20

score 4 · Answer 1 · 2011-07-05

4

Entering edit mode

12.8 years ago

Peter 6.0k

This is not a problem with Biopython, but with the input data files.

If you use the website (or Entrez interface), you can choose between "GenBank" and "GenBank (full)". You should pick the "full" version if you want to ensure the sequence is included. If you download the GenBank files by FTP, they will usually have the full sequence.

If you can't re-download the GenBank file, it would in theory be possible to use the Entrez API to fetch each of the sequences references in the CONTIG string and join them together. This is however going to be complicated and error prone (e.g. network issues), so I would avoid this.

ADD COMMENT • link 12.8 years ago by Peter 6.0k

1

Entering edit mode

I suggest you email the NCBI to ask about the missing sequences in the GenBank RefSeq files. But in the meantime, in your position I would just download the FASTA files provided.

ADD REPLY • link 12.8 years ago by Peter 6.0k

0

Entering edit mode

Thanks, I made the data issue a little clearer. This happens with the official RefSeq release 47. My thinking was that is would be self-contained and thus be able to flatten the files in order to contain no sequence references any more without additional data.

ADD REPLY • link 12.8 years ago by Käpt'N Silico ▴ 20

0

Entering edit mode

Could you add the FTP link to the file(s) you are using? Somewhere under ftp://ftp.ncbi.nih.gov/refseq/ I presume... in most cases there is a FASTA file provided which you could just download instead (look for extension fna).

ADD REPLY • link 12.8 years ago by Peter 6.0k

0

Entering edit mode

Yes thanks, I know there are FASTA files which I already downloaded. This question was rather one of principle. I hate to download both, GenBank and FASTA to get sequence and annotation information. If it is not possible to convert GenBank to FASTA offline, there is no sense in providing two formats with overlapping information and cause traffic on the FTP site.

ADD REPLY • link 12.8 years ago by Käpt'N Silico ▴ 20