Question

How to write FULL genbank files with Biopython SeqIO module?

0

Entering edit mode

8.2 years ago

russianconcussion ▴ 30

Hi,

I am trying to split up the Synechococcus genbank files from NCBI Genbank into separate genbank files for each genome. Some of the genomes have several genbank files because they are draft assemblies. So, I import the SeqIO library from Bio, parse the conglomerated genbank files, put them into a dictionary of lists with their gb.name as the key, then iterate through the dictionary with SeqIO.write to write the genbank files. However, the output does not include the fasta nucleotide information that is usually in a FULL genbank file, just the features. I've looked through the BioPython Tutorial and previous questions on BioStars, but I am still wondering what am I doing wrong. Any help?

Original data: https://www.dropbox.com/s/kfwothubsrh5vtc/synechococcus.gb?dl=0

from collections import defaultdict
from Bio import SeqIO

gbDict = defaultdict(list)
gbs = SeqIO.parse("synechococcus.gb", "genbank")

for gb in gbs:                                
    gbDict[gb.name].append(gb)

for gb in gbDict:                             
    SeqIO.write(gbDict[gb], ".".join([gb,"gb"]), "genbank")

genbank Biopython • 5.0k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by russianconcussion ▴ 30

Ram · Answer 1 · 2016-02-11

The problem is your input data - some of the records simply don't have the sequence you want, just the features. Looking at your GenBank file, some of your records do have sequences in it - but there are others which do not, e.g. NZ_LFEK01000000 only has a WGS line.

Depending on how you get the data from the NCBI, they will either give you full GenBank files (with sequence) or summary records without the sequence, but instead a CONTIG or WGS line saying how the sequence is constructed from other records.

Also, I would rewrite your code to avoid loading everything into RAM in a dictionary, and also to avoid the slight data loss from from reading and writing the records. Instead, use the Biopython indexing to pull out the raw records:

from Bio import SeqIO
d = SeqIO.index("synechococcus.gb", "genbank")
print("You have %i sequences" % len(d))
for key in d:
    with open(key + ".gb", "wb") as handle:
        handle.write(d.get_raw(key))
print("Done")

This will name the files using record.id rather than record.name which you may or may not prefer.