Question

Is There A Perl Script To Select Longest Sequence Region(Translated) Between Stop Codons Translated Using Transeq?

1

Entering edit mode

10.2 years ago

biostar ▴ 170

Hi Everyone, I am trying to figure out the easiest way to extract the longest sequence from a translated frame of DNA sequences. The sequences I have were translated using transeq, now I need to extract the region with the longest gap between two stop codons (as indicated by * in the sequence below. The translated sequence look like this:

>Hh_TY_4
 KSHLATLH*LSQCH*NSMSISAF*VIYVYFE*S*HTFRIPFVKTF**FK**NSSIHNEKFVLFP**KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN*HLQSQ*SEESQRCR
>Hh_TY_5
 KSHLATLH*LSQCH*NSMSISAF*VIYVYFE*S*HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN*HLQSQ*SEESQRCR

And the output I want is like this:

>Hh_TY_4
KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN
>Hh_TY_5
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN

Any suggestions on this would be really appreciated. Thanks!

perl translation • 3.5k views

ADD COMMENT • link updated 10.2 years ago by Kenosis ★ 1.3k • written 10.2 years ago by biostar ▴ 170

score 1 · Answer 1 · 2014-01-27

try this:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>";
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    my @orfs = split (/\*/, $seq);
    shift @orfs; pop @orfs; # remove last and first, I think you want only if an ORF is BETWEEN two *, if not, just delete this line
    my $sel = shift @orfs;
    foreach my $next (@orfs) {
        $sel = $next if ((length $sel) < (length $next))
    }
    print ">$id\n$sel\n";
}

then run as: perl getLongestORF.pl < FASTA > OUT

score 1 · Answer 2 · 2014-01-27

Perhaps the following will be helpful:

use strict;
use warnings;
use List::Util qw/max/;

local ( $/, $" ) = qw/ > * /;

while (<>) {
    if ( my ( $id, @seq ) = /([^>\n]+)/g ) {
        my %h;
        push @{ $h{ length $_ } }, $_ for split /\*/, join '', @seq;
        print ">$id\n@{ $h{ max keys %h } }\n";
    }
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>Hh_TY_4
KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTSTRVRSRNNQENPVRN
>Hh_TY_5
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN

This script handles cases where there is more than one 'longest' sequence region, separating those sequences with a "*". For example, a possible result might be:

>Hh_TY_6
HTFRIPFVKTFFKNSSIHNEKFVLFPKDISENPVRN*KDISSRWKSHSCCWRRERPKQFVCIWRWYSICQTST

This script builds a hash of arrays (HoA) for each set of sequence regions (strings), where the key is the length and the associated value is a reference to a list of sequence regions. max keys %h returns the key with the greatest value--thus the key whose value is a reference to the list of sequence regions with the greatest length. $h{ max keys %h } returns the reference to the list of those strings. @{ $h{ max keys %h } } produces an array of those strings, and since it's enclosed within double quotes, the array is interpolated. Because of the local $" = '*', multiple array elements will be displayed with a "*" between them during this interpolation.