How To Find Any Of Many Motifs In Certain Sequences Using Perl?
4
8
Entering edit mode
13.9 years ago
nikulina ▴ 300

Dear colleagues, I am applying for your help once more. Here is my perl script for counting the length between the binding sites and the start points of exons. So, there are some genes saved in the file called sequence.txt and some amount of binding sites in motif.txt. The thing I want to do is to count the length of the fragment for every gene if it has any of the binding sites from motif.txt. This current script does not work in the way I would like it. How should it be changed? Shall I add another while loop for $motif?

$string_filename = 'sequence.txt';  

open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");

$motif_filename = 'motif.txt';   

open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n"); 

local $/ = "\n>";  

   while (my $seq = <FILE>) {
chomp $seq;
$seq =~ s/^>*.+\n//;  
$seq =~ s/\n//g;  

$R = length $seq;  
  $motif = <MOTIF>; 
chomp $motif;
$motif =~ s/^>*.+\n//; 
$motif =~ s/\n//g; 
  if ( $seq =~ /$motif/ ) {     ## insert actual binding site 
    $M = $';
    $W = length $M;

    if ( $seq =~ /[A-Z]/) {    ##  exon start
        $K = $`;   
        $Z = length $K;  
        $x = $W + $Z - $R;     
        print "\n\ the distance is the following: $x\n\n";

    } else {  
        print "\n\ I couldn't find the start codon.\n\n";
    }
} else {  
    print "\n\ I couldn't find the binding site.\n\n";   
}

}

close MOTIF;   
close FILE;   

exit;
perl motif sequence • 8.0k views
ADD COMMENT
3
Entering edit mode

please, show us what your inputs look like.

ADD REPLY
1
Entering edit mode

There is a FASTQ-FASTA converter here - and Bioperl's Bio::SeqIO::fastq will handle some FASTQ formats.

ADD REPLY
0
Entering edit mode

the binding sites look like this

ADD REPLY
0
Entering edit mode

and the genes

ADD REPLY
0
Entering edit mode

binding site=A FASTQ file ??!!!!!

ADD REPLY
0
Entering edit mode

I recognise that it is a problem but i haven't found the way to convert it to fasts yet.

ADD REPLY
0
Entering edit mode

And further discussion of FASTQ-FASTA conversion at StackOverflow.

ADD REPLY
8
Entering edit mode
13.9 years ago
Neilfws 49k

I think you need to take a step back and think about the logic of your problem, before writing the code. Break your problem down into steps, like this:

  1. Read motifs into a data structure over which you can loop (e.g. an array)
  2. Read in your sequence(s)
  3. For each sequence, identify segments between binding site and exon start
  4. For each segment...
  5. For each motif, check for match to segment
  6. If at least one match, store/print segment length (and perhaps other useful information)

For tools to convert FASTQ-FASTA, see comments under your question.

As we said in your previous query, Bioperl makes this kind of task very trivial. There is a learning curve, but I would urge you to investigate - it will pay off in the long term. Using more "standard" code also makes it a lot easier for other people to read and debug.

ADD COMMENT
0
Entering edit mode

Thank you, I'll try to do something with it.

ADD REPLY
3
Entering edit mode
13.9 years ago

you should load your motifs into an array and then iterate over that motif array for each sequence

ADD COMMENT
1
Entering edit mode
13.9 years ago
nikulina ▴ 300

I eventually recognised how to modify my script to do the job I wished. Now it looks like the following:)

$string_filename = 'seq.txt';  
open(FILE, $string_filename) || die("Couldn't read file $string_filename\n");   
$motif_filename = 'motif.txt';   
open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filename\n");   
local $/ = "\n>";

while (@motif = <MOTIF>) {              
  while (my $seq = <FILE>) {
    chomp $seq;
    $seq =~ s/^>*.+\n//;  
    $seq =~ s/\n//g;  
    $R = length $seq;    
    foreach $site (@motif) {
      chomp $site;
      $site =~ s/^>*.+\n//; 
      $site =~ s/\n//g;  
      if ( $seq =~ /$site/ ) {     
        $M = $';
        $W = length $M;
        if ( $seq =~ /[AGTC]/) {   
          $K = $`;   
          $Z = length $K;  
          $x = $W + $Z - $R;     
          print "\nthe distance is the following: $x\n\n";
                               } 
                             } 
                            } 
                           }  
                          } 
close FILE;    
exit;

Today I tried to do the same using BioPerl module and recognised it to be much more easier indeed. Thanks a million for your help and support:)

ADD COMMENT

Login before adding your answer.

Traffic: 2387 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6