Perl: Ignoring Repetitive K-Mers In Hash Construction
1
1
Entering edit mode
10.5 years ago
Neal ▴ 60

Hello all,

I am trying to find the most frequent k-mers with "n" mismatches. But I'm running into a problem while creating a hash table of such k-mers.

Sample Input:ACGTTGCATGTCGCATGATGCATGAGAGCT
k-mer size:4
mismatch:1
Output: GATG ATGC ATGT

My code's logic is as follows:

1. Extract all kmers of given size
2. For each kmer, find other kmers with given mismatch range
3. Print kmers which have maximum number of fuzzy matches

I am specifically stuck on point 2 above. I am using a while loop to iterate over all kmers. Problem is I need to count a particular kmer only once eg :

ACGT CGTT GTTG TTGC TGCA GCAT CATG ATGT TGTC GTCG TCGC CGCA GCAT CATG **ATGA** TGAT GATG ATGC TGCA GCAT CATG **ATGA** TGAG GAGA AGAG GAGC AGCT

a kmer like ATGA should be counted just once (it appears twice as I've highlighted above with ** for ease of viewing). While constructing the hash, it gets counted twice and all the fuzzy matches also get doubled. Similarly GCAT occurs thrice above, and while constructing the hash, it gets counted thrice.

I need a way to skip if the kmer already exists in the hash ( kmer being the key).

The brilliant code for fuzzy matching via subroutine was provided by Kenosis here: Perl :Fuzzy Matching Problem

My code is as under:

use 5.014;
use warnings;
use List::Util qw/max/;

$_ = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
my $len_seq = length $_;
my $len_pat = 4;

my $i = 0;
my @all_kmer;

#Extract all kmers of given size
while($i <= $len_seq - $len_pat){
    my $substr = substr $_, $i, $len_pat;
    push @all_kmer, $substr;

    $i++
}


#Initialize hash of kmer counts    
my %counts;
my $j = 0;
while ($all_kmer[$j]){

foreach (@all_kmer){
        push( @{$counts{$all_kmer[$j]}},$_  ) if strDiffMaxDelta ($all_kmer[$j], $_, 1);

        }


    $j++;

    #Skip if key already exists #Need Proper Code for this
    if(exists $counts{$all_kmer[$j]} ){
        say "Key already exists for $all_kmer[$j]";
        $j=$j+3; # Need corrections / a way to skip to next element in @all_kmer
    }
 }

#Count fuzzy matches for each key in hash
my @fuzzy_kmer_counts;                                                                    
foreach my $key (keys %counts){
     push @fuzzy_kmer_counts, scalar @{$counts{$key}};  

}

#Determine maximum number of fuzzy matches
my $max = max(@fuzzy_kmer_counts);

#Print key(s) which have maximum number of fuzzy matches
foreach my $key (keys %counts){
     say "Key: $key\t" if scalar @{$counts{$key}} == $max;
     print OUT $key,"\t" if scalar @{$counts{$key}} == $max;
}


###########################
#Subroutine
###########################
sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}
perl • 4.2k views
ADD COMMENT
1
Entering edit mode

Just to be clear, is the expected output from your script the "GATG ATGC ATGT" above, or something different?

ADD REPLY
0
Entering edit mode

Hi Kenosis, thanks for your comment. The expected output is indeed "GATG ATGC ATGT" for the given sample input of "ACGTTGCATGTCGCATGATGCATGAGAGCT ", Each of these kmers,with 1 mismatch. has 5 fuzzy matches. Other kmers have fewer fuzzy matches.

ADD REPLY
0
Entering edit mode

Hi Kenosis, Michael's code snippet solves the logic where I was stuck. I had to check the existence of the key before assigning the values in arrays.

ADD REPLY
1
Entering edit mode
10.5 years ago
Michael 54k

Hm, maybe something like this, replacing your two nested loops?

 I: foreach my $i (@all_kmer) {
         next I if exists $counts{$i};
    J:  foreach my $j (@all_kmer) {
                    push( @{$counts{$i}, $j  ) if strDiffMaxDelta ($i, $j, 1);

         }
  }
ADD COMMENT
0
Entering edit mode

Hi Michael! Thanks for your input! It works perfectly! Just one small query, would it be possible to explain what the I: and J: mean and would the loops falter without them?

ADD REPLY
1
Entering edit mode

I: and J: are labels, they allow to e.g. jump out of the inner loop with next I; or out of any loop level. The construct works fine without, using just next, but I have started to label all my loops with sensible names for annotation, like in next LINE if /^#/.

ADD REPLY
1
Entering edit mode

I just noticed that the reason for the duplication is that you compute strDiffMaxDelta for all A:B and also for B:A. Then a solution as outlined here A: Comparing alignments in an iterative process with Perl is more efficient because it avoids the symmetric case.

ADD REPLY

Login before adding your answer.

Traffic: 1522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6