Extract genbank features inside a given range
1
0
Entering edit mode
5.6 years ago

How to retrieve all features inside a given range from genbank record, using biopython?

f1 = FeatureLocation(10, 40, strand=+1)
recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat.location in f1:
        print feat.location
        print '\n'

i've tried this, but i get this error

ValueError: Currently we only support checking for integer positions being within a FeatureLocation.

biopython genbank • 3.4k views
ADD COMMENT
0
Entering edit mode
5.6 years ago
Joe 21k

I think you don't need your first line. From a quick read of the docs, FeatureLocation seems to be more used in creating features at specific locations than testing them. Consequently its expecting a simple integer, but you're giving it an object.

for example, the following works with my test data:

recs = [rec for rec in SeqIO.parse('my_genbank.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if 456 in feat:    # Notice its only a simple integer
        print feat

# ----- Output -----

type: CDS
location: [7:457](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.60', 'protein motif:Pfam:PF06841.6']
    Key: locus_tag, Value: ['PAU_01961']
    Key: product, Value: ['T4-like virus tail tube protein gp19']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIEYRDGVGNWLQMPGQRQRPTITLKRGIFKGQSKLYDWINSISLNQIEKKDISISLTDETGSNLLITWNIANAFPEKLTAPSFDATSNEVAVQEISLKADRVTVEFH']

Incidentally, you can use normal python 'slicing' nomenclature to extract subregions of a genbank and write them out in any format you like with SeqIO.write. I exploit this for creating sub-setted genbanks in this script for instance:

https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

Particularly line 121.

ADD COMMENT
0
Entering edit mode

Thank you for your reply! Although, i still don't get it I need to check if the feature is inside a given range, not the other way around I try something like:

SeqIO.write(record, cwd+'/test_directory_for_trash/test_for_annotation.gb', "gb")
recs = [rec for rec in SeqIO.parse(cwd+'/test_directory_for_trash/test_for_annotation.gb', "genbank")]
feats = [feat for feat in rec.features if feat.type == "CDS"]
for feat in feats:
    if feat in [10000,12000]:
        print feat.location
        print '\n'

but i get this

ValueError: Locus identifier 'NW_016067404.1:1119446-1134925' is too long

ADD REPLY
0
Entering edit mode

I don't think your approach will really work in that case.

Try the following instead:

from __future__ import print_function
import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(xrange(int(start),int(end),1))

for f in feats:
    span = set(xrange(f.location._start.position, f.location._end.position))
    if span & desired:
        print(f)

call the script as:

python scriptname.py genbank.gb 123:456

I adapted this slightly from here, so there are answers out there already if you need more info.

If you want to filter by strand still, you can do that in addition by testing f further.

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6