Biostar Beta. Not for public use.
Question: Getting proteins from genomic interval of a genbank file
0
Entering edit mode

I am trying to get all proteins from a genbank file considering a sequence interval. Here is the code of the script written in python:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end

import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(range(int(start),int(end),1))

for f in feats:
    span = set(range(f.location._start.position, f.location._end.position))
    if span & desired:
        print("%s,%s,%d:%d\n%s\n" % (
                    f.qualifiers['locus_tag'][0],
                    f.qualifiers['product'][0],
                    f.location._start.position,
                    f.location._end.position,
                    f.qualifiers["translation"][0]))

There are two problems: For some regions containing pseudogenes, printing "f.qualifiers["translation"][0]" breaks the script;

$ python get_proteins.py ../gbk/NZ_AOLM01000025.1.gbk 92686:113021 > island_proteins.faa
Traceback (most recent call last):
  File "get_proteins.py", line 24, in <module>
    f.qualifiers["translation"][0]))
KeyError: 'translation'

I know that I must use "try", but I was not successful in doing it.

The second problem is that some genbank files give the following error:

$ python get_proteins.py ../gbk/NZ_CM001555.1.gbk 530979:551299 > island_proteins.faa
Traceback (most recent call last): File "get_proteins.py", line 17, in> <module>
 span = set(range(f.location._start.position, f.location._end.position)) AttributeError: 'CompoundLocation' object
has no attribute '_start'

I did not understand this last problem.

Thank you.

ADD COMMENTlink 9 months ago fhsantanna • 440 • updated 9 months ago Joe 12k
Entering edit mode
0

I think I managed the problem:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end


#!/usr/bin/env python

#This script is a modification of the script found in Peter Cock's site (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/).
# Usage: python gbk2faa.py  start:end 

import sys
from Bio import GenBank
from Bio import SeqIO

input_handle  = open(sys.argv[1], "r")
output_handle = open(sys.argv[3], "w")

desired_interval = sys.argv[2].split(':')
start = int(desired_interval[0])
end = int(desired_interval[1])
#desired_interval = set(range(int(start),int(end),1))

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record[start:end].features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s,%s,%s,%s\n%s\n" % (
                    seq_feature.qualifiers['locus_tag'][0],
                    seq_feature.qualifiers['product'][0],
                    seq_record.id,
                    seq_record.description,
                    seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue

output_handle.close()
input_handle.close()
print("Done")
ADD REPLYlink 9 months ago
fhsantanna
• 440
2
Entering edit mode

I've built something around that exact code before, for this problem: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

It requires that the file be a single contiguous genbank (for now). It prints the features, but it should be easy to alter the output to get fasta sequences or similar via Biopython.

ADD COMMENTlink 9 months ago Joe 12k
1
Entering edit mode

using xslt and then filter the TSV output with awk

:

ADD COMMENTlink 9 months ago Pierre Lindenbaum 120k
Entering edit mode
0

Thank you very much. But for 100,000 files this is not feasible.

ADD REPLYlink 9 months ago
fhsantanna
• 440
Entering edit mode
0

But for 100,000 files this is not feasible.

a loop ?

ADD REPLYlink 9 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

Sorry, you are right. I did not pay attention. Maybe your solution is faster than mine. I'll test it. Thank you again.

ADD REPLYlink 9 months ago
fhsantanna
• 440
Entering edit mode
0

Could you provide your awk code, please?

ADD REPLYlink 9 months ago
fhsantanna
• 440
Entering edit mode
1

something like (not tested):

awk -v B=123  -v E=456 '{x1=int($2);x2=int($3);b=x1<x2?x1:x2;e=x1<x2?x2:x1; if (!(b>int(E) || e<int(B))) print; }' input.tsv
ADD REPLYlink 9 months ago
Pierre Lindenbaum
120k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0