I've obtained TransDecoder output from 9 different Trinity transcriptomes from as many eukaryotic species. Within the _longest_orfs.pep output files, there are multiple predicted protein sequences from each isoform (see data subset below). My goal is to isolate the longest, most complete protein sequence from each isoform. Can someone here provide me with a python/perl script that will identify and write to a secondary file these target protein sequences and their corresponding IDs under the following conditions: (1) the protein sequence must be at least 50 amino acids in length, (2) the protein sequence must begin with a Methionine (M) residue, and (3) the protein sequence must end with a stop codon ().
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p1 type:5prime_partial len:303 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:2-910(+)
GANAGSSAQETGANTESGKQGTGANAGSSAQETGANTESGKQGTGANAGSSGPKTGAAAVNEGQATSADAGSKPTGTNTQPSTGEAEVGGQADTETPLSAGPTGVEQGTAAETVESSPNAGEPAETLEGGHSSECQQFAYRNVGNKIVFDVEVDCQLSVDVTTAQASKTKGVGPDRILAEMQTSSGNKAAVGSCPELVTYTKPGGIIHIKTSQNCLIIIYPERKATGRKGAVPRNVLFTVDSGTQVETQAEGRQVKVAKKVTGKAEKVMGMKETVKVRQIKRTEKVTKPKKTGEGKAKANKI*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p2 type:complete len:132 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:813-418(-)
MPITFSALPVTFFATLTCLPSACVSTWVPLSTVNKTFRGTAPFLPVAFLSGYIIMRQFCDVLIWIIPPGFVYVTSSGHEPTAALFPLDVCISAKILSGPTPFVLDACAVVTSTDNWQSTSTSKTILLPTFR*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p3 type:complete len:117 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:985-635(-)
MCDSSIMQDLSLNNLIYVFFFKGSVLYFICFRFSLAGFLWFSHFFRSFNLSNFDCLLHAHNFLCFTCHFLCYLNLSTLSLRFYLGSTVYCEQNISWDSAFPACRFSFRIYYYETVL*
Thank you in advance!