Question

TransDecoder: Filtering *_longest_orfs.pep Results

0

Entering edit mode

5.8 years ago

mduhon8 • 0

I've obtained TransDecoder output from 9 different Trinity transcriptomes from as many eukaryotic species. Within the _longest_orfs.pep output files, there are multiple predicted protein sequences from each isoform (see data subset below). My goal is to isolate the longest, most complete protein sequence from each isoform. Can someone here provide me with a python/perl script that will identify and write to a secondary file these target protein sequences and their corresponding IDs under the following conditions: (1) the protein sequence must be at least 50 amino acids in length, (2) the protein sequence must begin with a Methionine (M) residue, and (3) the protein sequence must end with a stop codon ().

>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p1 type:5prime_partial len:303 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:2-910(+)
GANAGSSAQETGANTESGKQGTGANAGSSAQETGANTESGKQGTGANAGSSGPKTGAAAVNEGQATSADAGSKPTGTNTQPSTGEAEVGGQADTETPLSAGPTGVEQGTAAETVESSPNAGEPAETLEGGHSSECQQFAYRNVGNKIVFDVEVDCQLSVDVTTAQASKTKGVGPDRILAEMQTSSGNKAAVGSCPELVTYTKPGGIIHIKTSQNCLIIIYPERKATGRKGAVPRNVLFTVDSGTQVETQAEGRQVKVAKKVTGKAEKVMGMKETVKVRQIKRTEKVTKPKKTGEGKAKANKI*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p2 type:complete len:132 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:813-418(-)
MPITFSALPVTFFATLTCLPSACVSTWVPLSTVNKTFRGTAPFLPVAFLSGYIIMRQFCDVLIWIIPPGFVYVTSSGHEPTAALFPLDVCISAKILSGPTPFVLDACAVVTSTDNWQSTSTSKTILLPTFR*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p3 type:complete len:117 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:985-635(-)
MCDSSIMQDLSLNNLIYVFFFKGSVLYFICFRFSLAGFLWFSHFFRSFNLSNFDCLLHAHNFLCFTCHFLCYLNLSTLSLRFYLGSTVYCEQNISWDSAFPACRFSFRIYYYETVL*

Thank you in advance!

next-gen sequence cDNA TransDecoder transcriptome • 2.5k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by mduhon8 • 0

score 0 · Answer 1 · 2018-07-16

0

Entering edit mode

5.8 years ago

h.mon 35k

Use seqkit tab2fx to convert fasta to tab, then grep for "type: complete", then convert back from tab to fasta. Finally, if I am not mistaken, Trinity has a script to filter longest transcripts which might work for Transdecoder as well.

ADD COMMENT • link 5.8 years ago by h.mon 35k