Biostar Beta. Not for public use.
Question: Perl :Fuzzy Matching Problem
1
Entering edit mode

Hello all, I am looking to implement fuzzy matching in Perl using the module

use String::Approx

I need 3 mismatches. However, I am unable to implement the modifiers appropriately to get the desired outcome.

Input string :CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
Pattern ATTCTGGA
Output 6 7 26 27

So basically, I need the locations where the pattern matches input string approximately ( 3 mismatches)

My Code is as under:

use 5.014;
use warnings;
use String::Approx qw/amatch aindex/;


$_ = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "ATTCTGGA";
my $len_in = length $_;
my $len_pat = length $pat;


my $i = 0;
my @pattern;


while($i <= $len_in-$len_pat){ 
    my $substr = substr $_, $i, $len_pat;#Extract k-mer of length = $len_pat from position $i
    my $posn = index  $_,$substr, $i;#Extract start position of k-mer

    push @pattern, $substr if amatch ("$pat", ["S3"], $substr); #Push into @pattern if the extracted k-mer matches $pat by 3 substitutions 
    push @posn, $posn if amatch ("$pat", ["S3"], $substr); #Push into @posn the position, if the extracted k-mer matches $pat by 3 substitutions 

    $i++
}

say "@pattern";
say "@posn";

I would be grateful if it could be pointed out where I am going wrong. Right now, I get no output at all.

ADD COMMENTlink 6.3 years ago Neal • 40 • updated 8 months ago Biostar 20
6
Entering edit mode

You can calculate the "3 mismatches" between strings using Perl's exclusive-or operator (^) between those two strings, as it returns \x00 for each matching pair of characters, and a different value for non-matching characters. Given this, consider the following solution:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "ATTCTGGA";
my $len_in  = length $seq;
my $len_pat = length $pat;

my %posn_pattern;

for my $i ( 0 .. $len_in - $len_pat ) {

    #Extract k-mer of length = $len_pat from position $i
    my $substr = substr $seq, $i, $len_pat;

    #Save position/pattern if the extracted k-mer matches $pat by 3 or fewer substitutions
    $posn_pattern{$i} = $substr if strDiffMaxDelta( $pat, $substr, 3 );
}

say "$_ => $posn_pattern{$_}" for sort { $a <=> $b } keys %posn_pattern;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

Results:

6 => AATCCAGA
7 => ATCCAGAA
26 => ATTTCGGG
27 => TTTCGGGA

The strDiffMaxDelta subroutine counts the non-\x00 characters which indicates the number of character differences between the two sent strings, and then returns the result of the comparison of that number against the max delta (in this case 3). Of course, you can use $diffCount == $maxDelta if you want to return strings which have exactly three mismatches--in case I've misunderstood your desired results.

Hope this helps!

ADD COMMENTlink 6.3 years ago Kenosis ♦ 1.2k
Entering edit mode
0

Now that is a new one for me. Nice!

ADD REPLYlink 6.3 years ago
Chris Fields
♦ 2.1k
Entering edit mode
0

Many thanks Kenosis and exactly what was needed! This method is completely new for me! Would you be able to please explain this line of code my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;. Specifically what does () perform? I understand ^ is comparing $s1 and $s2. Also, it would be great if a resource explaining \x00 could be pointed out. I really would love to understand the logic. It's a brilliant code!

ADD REPLYlink 6.3 years ago
Neal
• 40
Entering edit mode
1

You're most welcome, nilan666! Am glad it worked for you.

Here's an explaination of that line:

my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
                 ^         ^            ^      ^
                 |         |            |      |
                 |         |            |      + - Do so globally
                 |         |            + - Match the XOR results for non-\x00 which indicates a character-pair difference
                 |         + - XOR the two strings against each other
                 + - 'Goatse' operator "=()="
  • \x00 is the hexidecimal value for the NULL character, and is the XOR operation's result when the two strings' characters are identical.
  • Goatse operator "=()=": right expression's evaluated in list context, w/results assigned to empty array "()="; that array is evaluated in scalar context "=()" w/results assigned to $diffCount
ADD REPLYlink 6.3 years ago
Kenosis
♦ 1.2k
Entering edit mode
1

Erm, Goatse operator.... wow. DO NOT GOOGLE THIS TERM WITHOUT THE 'OPERATOR' TERM ON THE END. Images would be deemed NSFL (Not Safe For Life)

ADD REPLYlink 6.3 years ago
Daniel
♦ 3.7k
Entering edit mode
0

But this is an awesome bit of code, thank you.

ADD REPLYlink 6.3 years ago
Daniel
♦ 3.7k
Entering edit mode
0

So if I understand correctly, using ^ , if $s1 and $s2 are identical \x00 would be true and $diffCount would be 0; conversely when $s1 and $s2 are different, ^\x00 stores the the character difference which gets assigned to $diffcount. This is awesome stuff!

ADD REPLYlink 6.3 years ago
Neal
• 40
Entering edit mode
1

Hi nilan666.

Perl's XOR on strings, e.g., $s1 ^ $s2, returns \x00 for each identical character-pair at position n in those two strings. So, 'Perl' ^ 'Perl' would return '\x00\x00\x00\x00', since those two strings are identical. The regex /[^\x00]/g globally matches only those characters in the XOR results which are not \x00. The result of globally matching the non-\x00 characters in the XOR results is assigned to $diffcount above, as it indicates the number of different characters when the two strings are compared character-by-character.

ADD REPLYlink 6.3 years ago
Kenosis
♦ 1.2k
Entering edit mode
0

Ah now I understand it much more clearly!! I reiterate that this code is sheer brilliance! By the way, I have used your subroutine in another code, but I am stuck on a particular logical challenge of that code. I would be grateful for your time if you could have a look here :Perl: Ignoring repetitive k-mers in hash construction

ADD REPLYlink 6.3 years ago
Neal
• 40

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0