Biopython drops GenBank LOCUS data?
2
0
Entering edit mode
8.3 years ago

Hello everyone,

I'm helping to write a Python application for storage and simple processing of RNA/DNA/protein sequences. I come from a software background and have no formal biology training at all (beyond high school). I just have a question about how/why Biopython pull information from GenBank.

I'm using Biopython to get a SeqRecord object for a GenBank ID using this function:

def get_seqrecord(gid, db="nucleotide"):
    try:
        handle = Entrez.efetch(db=db, id=gid, rettype="gb", retmode="text")
        return SeqIO.read(handle,"genbank")
    except:
        return None

This is fine, but the user also wants to store the "molecule type" for the sequence, from the LOCUS line, e.g.

LOCUS       JX978171               39269 bp    DNA     linear   INV 14-MAR-2013

would store "DNA"

LOCUS       AY994149                2957 bp    mRNA    linear   INV 01-JUN-2005

would store mRNA.

From what I can tell, Biopython parses that line and extracts the molecule type and uses it to set the sequence alphabet type. In the process, the precise molecule type is lost.

My questions are:

  • Is there a reason it does this? Should I not be relying on the molecule type from the LOCUS line?
  • Short of manually parsing the LOCUS line (which I would just do by copy-pasting the relevant Biopython code into my own function), is there a way I can get this information for a particular GenBank ID?

Thanks in advance for any help. Sorry if this is a really stupid question, I'm still learning huge swathes of this stuff.

sequence • 3.7k views
ADD COMMENT
1
Entering edit mode
5.1 years ago
Mathis DKZ ▴ 10

Even if this is an older post, some people might still look for this. Meanwhile it's possible to get the molculetype from the Locus:

for seq_record in SeqIO.parse(handle,"genbank"):  

        print(seq_record.annotations['molecule_type'])

All Locus Informations are name, sequence length, molecule type, molecule topology, Genbank division and date. This is how you get each out of a seq_record:

name = seq_record.name
seq_length = len(seq_record.seq)   
molecule_type = seq_record.annotations['molecule_type']
molecule_topology = seq_record.annotations['topology']
division = seq_record.annotations['data_file_divison']
date =  seq_record.annotations['date']

I hope this can help someone.

ADD COMMENT
0
Entering edit mode
8.3 years ago
Peter 6.0k

This is a known limitation, see https://github.com/biopython/biopython/issues/363

ADD COMMENT

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6