How to extract protein sequence of selected genes from gene bank like format
0
0
Entering edit mode
2.5 years ago
AP ▴ 80

Hello everyone,

I used this program called antismash to predict the secondary metabolite clusters in my genome of interest. The output I got is gene bank like format whish is shown below:

LOCUS       scaffold_10            47160 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  scaffold_10.
ACCESSION   scaffold_10
VERSION     scaffold_10
KEYWORDS    .
SOURCE
  ORGANISM
            .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.0.1-25e43a7(changed)
            Run date     :: 2021-11-05 17:39:45
            NOTE: This is a single cluster extracted from a larger record!
            Orig. start  :: 2049841
            Orig. end    :: 2097001
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
     protocluster    1..47160
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="join{[2069841:2070128](+),
                     [2070175:2070449](+), [2070507:2071973](+),
                     [2072023:2074178](+), [2074232:2074784](+),
                     [2074838:2077001](+)}"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      join(20001..20287,20335..20608,20667..22132,22183..24337,
                     24392..24943,24998..27160)
                     /aStool="rule-based-clusters"
                     /tool="antismash"
                     /cutoff="20000"
                     /detection_rule="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS
                     or hyb_KS or itr_KS or tra_KS))"
                     /neighbourhood="20000"
                     /product="T1PKS"
                     /protocluster_number="1"
     cand_cluster    1..47160
                     /SMILES="CC(=O)C(=O)O"
                     /candidate_cluster_number="1"
                     /contig_edge="False"
                     /detection_rules="cds(PKS_AT and (PKS_KS or ene_KS or
                     mod_KS or hyb_KS or itr_KS or tra_KS))"
                     /kind="single"
                     /product="T1PKS"
                     /protoclusters="1"
                     /tool="antismash"
     region          1..47160
                     /candidate_cluster_numbers="1"
                     /contig_edge="False"
                     /product="T1PKS"
                     /region_number="1"
                     /rules="cds(PKS_AT and (PKS_KS or ene_KS or mod_KS or
                     hyb_KS or itr_KS or tra_KS))"
                     /tool="antismash"
     CDS             1899..2720
                     /ID="CDS_13110"
                     /NRPS_PKS="Domain: PKS_KR (26-149). E-value: 7.6e-08.
                     Score: 24.2. Matches aSDomain:
                     nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /NRPS_PKS="type: other"
                     /gene="jgi.p_FusspF11_1_766668"
                     /gene_functions="biosynthetic-additional (smcogs)
                     SMCOG1001:short-chain dehydrogenase/reductase SDR (Score:
                     161.6; E-value: 3.9e-49)"
                     /gene_kind="biosynthetic-additional"
                     /phase="0"
                     /source="prediction"
                     /transl_table=1
                     /translation="MTPPPKSALESDPAVGAQRRFSVTGNAVVTGGAGVLGLHACDALL
                     EHGLEGLMILDVNPAQSQGQITSLQNKFPRAKIMALKVDVTDENVVNAAMEETARVLGS
                     IDTLICFVGVVGCVETLDMPVPQWRKILDINTTGSFICAQAAARQMVKRGRGGSIVFVA
                     SISAHRVNYPQPQAAYNVSKSALLMLKSCLAAEWARYGIRTNSISPGYMDTILNEGDGI
                     AEHRKIWAEHNPSGRMGAPSELTGTVVLLASSAGSYINGADIVVDGGGIVL"
     aSDomain        1977..2345
                     /aSDomain="PKS_KR"
                     /aSTool="nrps_pks_domains"
                     /database="nrpspksdomains.hmm"
                     /detection="hmmscan"
                     /domain_id="nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1
                     "
                     /evalue="7.60E-08"
                     /label="jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /locus_tag="jgi.p_FusspF11_1_766668"
                     /protein_end="149"
                     /protein_start="26"
                     /score="24.2"
                     /specificity="KR activity: inactive"
                     /specificity="KR stereochemistry: C1"
                     /tool="antismash"
                     /translation="AVVTGGAGVLGLHACDALLEHGLEGLMILDVNPAQSQGQITSLQN
                     KFPRAKIMALKVDVTDENVVNAAMEETARVLGSIDTLICFVGVVGCVETLDMPVPQWRK
                     ILDINTTGSFICAQAAARQ"
     aSModule        1977..2345
                     /domains="nrpspksdomains_jgi.p_FusspF11_1_766668_PKS_KR.1"
                     /incomplete
                     /locus_tags="jgi.p_FusspF11_1_766668"
                     /tool="antismash"
                     /type="pks"
     CDS_motif       1980..2042
                     /aSTool="nrps_pks_domains"
                     /database="abmotifs"
                     /detection="hmmscan"
                     /domain_id="nrpspksmotif_jgi.p_FusspF11_1_766668_0001"
                     /evalue="1.90E-05"
                     /label="PKSI-KR_m1"
                     /locus_tag="jgi.p_FusspF11_1_766668"
                     /protein_end="48"
                     /protein_start="27"
                     /score="17.3"
                     /tool="antismash"
                     /translation="VVTGGAGVLGLHACDALLEHG"
     CDS             join(3252..3263,3321..3443,3493..4428)
                     /ID="CDS_13111"
                     /NRPS_PKS="Domain: PKS_ER (28-277). E-value: 6.4e-13.
                     Score: 40.2. Matches aSDomain:
                     nrpspksdomains_jgi.p_FusspF11_1_794272_PKS_ER.1"
                     /NRPS_PKS="type: other"
                     /gene="jgi.p_FusspF11_1_794272"
                     /gene_functions="biosynthetic-additional (smcogs)
                     SMCOG1040:alcohol dehydrogenase (Score: 275.7; E-value:
                     7.6e-84)"
                     /gene_kind="biosynthetic-additional"
                     /phase="0"
                     /source="prediction"
                     /transl_table=1
                     /translation="MSQTNLSCVLYGPGKARFENRPVPSLKDAHDVIIRISYVGVCGSD
                     VHFWTDGGFARKVSEDQPLVMGHEASGIVRSIGPDVTLLKPGDRVAIEPGFSCRRCKQC
                     KDGRYNLCPKMKFAADPPLTQGTLSRFFSIPEDFAYKIPDSLSLEEAVLVEPLAVAVHG
                     IRLAGLEVGQRVLVQGSGTIGLLTAAVAKAYGAKQVYITDVNLDKIKFAKKYLECSAFI
                     PDLGSTPEENAARFKTETGLDDGVDAVIECTGVEASTQTGLLALSAGGVLVQVGLGKPV
                     QAIPIHAMSEKEIVLKTSFRYGPGDYEIALELLESGKVSVRPLISSITPFEKATEAWEK
                     TRKGKGIKNLIRGVQD"

and so on......

Now from this file I want to extract protein sequence of the CDS having PKS domain ( could be PKS_ER, PKS_KS or PKS_anything). In the output file I want to include LOCUS Id and CDS ID and location. I have never used python but I found a similar issue in Biostar using python and have made few lines of codes based on that. Please help me

from Bio import SeqIO
from Bio.SeqFeature import SeqFeature

gbk = "myfile_predicted_SMclusters.gbk"
fa = "myfile_PKS_protein.fa"
input_handle = open(gbk, "r")
output_handle = open(fa, "w")

for record in SeqIO.parse(input_handle, "genbank"):
    features = [feature for feature in record.features if feature.type == "CDS"]
    for feature in features:
        if feature.qualifiers["NRPS_PKS"][1] == "PKS":
        assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s %s\n%s\n" % (
                    record.description,
                    feature.location,
                    SeqFeature(feature.location).extract(record.seq)))

output_handle.close()
input_handle.close()

The error is as follows:

if feature.qualifiers["NRPS_PKS"][1] == "PKS":
KeyError: 'NRPS_PKS'
Please help!!!

Thanks

python parse genbank • 728 views
ADD COMMENT

Login before adding your answer.

Traffic: 3063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6