Question

Regex motif question - 2 or more residues out of XXXX are D/E ?

1

Entering edit mode

9.8 years ago

Steve Barratt ▴ 30

I would like to use regular expressions to identify a motif in an amino acid sequence. Part of the the motif is described as '2 or more out of XXXX are D or E'. I wonder if there is a way to specify this part directly with regular expressions instead of writing out all the alternatives or using a more iterative approach.

I'm actually using this in the find box of my editor (sublime text) as it accepts regex (not sure what extensions/definitions it goes to). Otherwise a perl version of regex is where I would implement this.

Thanks!

edit: changed title slightly

edit: changed question to include or more.

motif regular-expressions perl • 2.8k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Steve Barratt ▴ 30

0

Entering edit mode

What makes you think a regular expression captures such a soft rule? There's not much regular about it. Regex are for phone numbers and email addresses. This could be solved quickly with a sweep procedure looking at all 4-mers along the sequence.

ADD REPLY • link 9.8 years ago by karl.stamm 4.1k

0

Entering edit mode

I agree, this problem (N out of M == X) can't be solved with a regular expression unless you use the regex that enumerates all possible cases: eg: (2+ out of 4 == A)

/..AA|.A.A|A..A|.AA.|AA..|A.AA|AA.A|AAA.|.AAA|AAAA/

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Michael 54k

Ram · Accepted Answer · 2014-06-28

1

Entering edit mode

9.8 years ago

Steve Barratt ▴ 30

I think I've figured it out now, using lookahead (?=pattern) to link two regular expressions like an AND:

~~(?=.?[DE]{1,4}.?[DE]{1,2}.?).{4}~~

The first part (in brackets) stipulates the pattern described by the following part must have at least 2 Ds or Es which may have other characters before, after or between them. The second part (following brackets) says the result must be four characters long.

EDIT: PLUS an alternate with two wildcard characters in the middle

(?=.?[DE]{1,4}.?[DE]{1,2}.?|[DE]..[DE]).{4}

I'm not sure how this would deal with overlapping motifs (I only came across regular expressions recently) but this is adequate for my needs now.

ADD COMMENT • link 9.8 years ago by Steve Barratt ▴ 30

1

Entering edit mode

This is unfortunately incorrect, you can test your regex like so:

perl -ne ' @x = /(?=.?[DE]{1,4}.?[DE]{1,2}.?).{4}/; print scalar @x,"\n"; '

It doesn't work for pattern DXXD, DXXDX, XDXXDX, etc.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Michael 54k

0

Entering edit mode

Thanks! very observant

I've added on an inelegant alternate that mops those up now :(

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Steve Barratt ▴ 30

0

Entering edit mode

Try Perl's transliteration operator:

use strict;
use warnings;

while ( my $string = <DATA> ) {
    chomp $string;
    my $count = $string =~ tr/deDE//;
    my $twoPlus = $count > 1 ? '*' : '';
    print "$string: $count$twoPlus\n"
}

__DATA__
XXXXXXXXXX
DDXXXXXXDX
XXEXXXDXEX
XXDXXXXXXX
DEDEDEDEDE
DXXXXXXXXE

Output:

XXXXXXXXXX: 0
DDXXXXXXDX: 3*
XXEXXXDXEX: 3*
XXDXXXXXXX: 1
DEDEDEDEDE: 10*
DXXXXXXXXE: 2*

Hope this helps!

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Kenosis ★ 1.3k