Question

Getting proteins from genomic interval of a genbank file

0

Entering edit mode

5.0 years ago

fhsantanna ▴ 610

I am trying to get all proteins from a genbank file considering a sequence interval. Here is the code of the script written in python:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end

import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(range(int(start),int(end),1))

for f in feats:
    span = set(range(f.location._start.position, f.location._end.position))
    if span & desired:
        print("%s,%s,%d:%d\n%s\n" % (
                    f.qualifiers['locus_tag'][0],
                    f.qualifiers['product'][0],
                    f.location._start.position,
                    f.location._end.position,
                    f.qualifiers["translation"][0]))

There are two problems: For some regions containing pseudogenes, printing "f.qualifiers["translation"][0]" breaks the script;

$ python get_proteins.py ../gbk/NZ_AOLM01000025.1.gbk 92686:113021 > island_proteins.faa
Traceback (most recent call last):
  File "get_proteins.py", line 24, in <module>
    f.qualifiers["translation"][0]))
KeyError: 'translation'

I know that I must use "try", but I was not successful in doing it.

The second problem is that some genbank files give the following error:

$ python get_proteins.py ../gbk/NZ_CM001555.1.gbk 530979:551299 > island_proteins.faa
Traceback (most recent call last): File "get_proteins.py", line 17, in> <module>
 span = set(range(f.location._start.position, f.location._end.position)) AttributeError: 'CompoundLocation' object
has no attribute '_start'

I did not understand this last problem.

Thank you.

genbank proteins cds biopython • 1.9k views

ADD COMMENT • link updated 5.0 years ago by Joe 21k • written 5.0 years ago by fhsantanna ▴ 610

0

Entering edit mode

I think I managed the problem:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end


#!/usr/bin/env python

#This script is a modification of the script found in Peter Cock's site (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/).
# Usage: python gbk2faa.py  start:end 

import sys
from Bio import GenBank
from Bio import SeqIO

input_handle  = open(sys.argv[1], "r")
output_handle = open(sys.argv[3], "w")

desired_interval = sys.argv[2].split(':')
start = int(desired_interval[0])
end = int(desired_interval[1])
#desired_interval = set(range(int(start),int(end),1))

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record[start:end].features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s,%s,%s,%s\n%s\n" % (
                    seq_feature.qualifiers['locus_tag'][0],
                    seq_feature.qualifiers['product'][0],
                    seq_record.id,
                    seq_record.description,
                    seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue

output_handle.close()
input_handle.close()
print("Done")

ADD REPLY • link 5.0 years ago by fhsantanna ▴ 610

1

Entering edit mode

5.0 years ago

Pierre Lindenbaum 161k

using xslt and then filter the TSV output with awk

:

ADD COMMENT • link 5.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you very much. But for 100,000 files this is not feasible.

ADD REPLY • link 5.0 years ago by fhsantanna ▴ 610

0

Entering edit mode

But for 100,000 files this is not feasible.

a loop ?

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry, you are right. I did not pay attention. Maybe your solution is faster than mine. I'll test it. Thank you again.

ADD REPLY • link 5.0 years ago by fhsantanna ▴ 610

0

Entering edit mode

Could you provide your awk code, please?

ADD REPLY • link 5.0 years ago by fhsantanna ▴ 610

1

Entering edit mode

something like (not tested):

awk -v B=123  -v E=456 '{x1=int($2);x2=int($3);b=x1<x2?x1:x2;e=x1<x2?x2:x1; if (!(b>int(E) || e<int(B))) print; }' input.tsv

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 161k

score 3 · Accepted Answer · 2019-05-07

I've built something around that exact code before, for this problem: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

It requires that the file be a single contiguous genbank (for now). It prints the features, but it should be easy to alter the output to get fasta sequences or similar via Biopython.