Question

Matching peptides which differ by one letter?

0

Entering edit mode

6.0 years ago

reui879 ▴ 10

Is there any I can match peptides differing by only one letter in two different TSV files? My original peptides in file 1 contains a point mutation represented by a lowercase letter. File 2 contains mutated versions of the same peptides, but changed to the remaining 19 amino acids in the same location as the lowercase letter.

If there is a match I would also like to divide the binding affinity of the original in File 1, column 4 by the binding affinity of the mutations in File 2, column 4.

File 1:

TCGA-A3-3316-0  HLA-A0201   SLVFLPFnT   1.90<=WB
TCGA-A3-3316-0  HLA-A0201   KLLLAtKSL   1.70<=WB    
TCGA-A3-3316-0  HLA-A0201   sIWKHATPV   0.30<=SB    
TCGA-A3-3316-0  HLA-A0201   KVTSIQhWV   1.90<=WB
TCGA-A3-3316-0  HLA-A0201   MtYDRYVAI   1.10<=WB

File 2:

TCGA-A3-3316-0  HLA-A0201   KVTSIQAWV   2
TCGA-A3-3316-0  HLA-A0201   KVTSIQCWV   2.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQDWV   4.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQEWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQFWV   0.4
TCGA-A3-3316-0  HLA-A0201   KVTSIQGWV   7.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQIWV   1.2
TCGA-A3-3316-0  HLA-A0201   KVTSIQKWV   9
TCGA-A3-3316-0  HLA-A0201   KVTSIQLWV   1.4
TCGA-A3-3316-0  HLA-A0201   KVTSIQMWV   1.1
TCGA-A3-3316-0  HLA-A0201   KVTSIQNWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQPWV   1.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQQWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQRWV   7.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQSWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQTWV   2.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQVWV   1.6

script programming • 1.6k views

ADD COMMENT • link updated 6.0 years ago by cpad0112 21k • written 6.0 years ago by reui879 ▴ 10

0

Entering edit mode

You're going to need a custom script for it. For each entry in file1, you can replace the lower case character by a ., then use that as a word-bounded regex to get matching lines from your file2.

I'm not giving you code as this is an excellent opportunity to try your hand at some text processing code. I'd recommend Python because it's easier to code in.

ADD REPLY • link 6.0 years ago by Ram 43k

score 1 · Answer 1 · 2018-04-20

#!/usr/bin/env python

import sys
import re

f1 = 'file1.txt'
f2 = 'file2.txt'

aa_table = [ "F", "L", "I", "M", "V", "S", "P", "T", "A", "Y", "H", "Q", "N", "K", "D", "E", "C", "W", "R", "G" ]

patterns = []
affinities = {}

with open(f1, "r") as f1h:
    for f1l in f1h:
        (tcga, hla, res, aff) = f1l.rstrip().split('\t')
        idx = int([i for i, c in enumerate(res) if c.islower()][0])
        aa_table_filtered = [x for x in aa_table if x != res[idx].upper()]
        pat = res[:idx] + '[' + '|'.join(aa_table_filtered) + ']' + res[(idx+1):]
        aff_head, aff_sep, aff_tail = aff.partition('<=')
        patterns.append(pat)
        affinities[pat] = float(aff_head)

with open(f2, "r") as f2h:
    for f2l in f2h:
        (tcga, hla, res, aff) = f2l.rstrip().split('\t')
        aff = float(aff)
        for pattern in patterns:
            if re.search(r"%s" % (pattern), res):
                sys.stdout.write("%s\n" % '\t'.join([tcga, hla, res, str(affinities[pattern]/aff)]))
                break

Result:

$ ./findPointMutations.py
TCGA-A3-3316-0  HLA-A0201       KVTSIQAWV       0.95
TCGA-A3-3316-0  HLA-A0201       KVTSIQCWV       0.76
TCGA-A3-3316-0  HLA-A0201       KVTSIQDWV       0.4222222222222222
TCGA-A3-3316-0  HLA-A0201       KVTSIQEWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQFWV       4.749999999999999
TCGA-A3-3316-0  HLA-A0201       KVTSIQGWV       0.2533333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQIWV       1.5833333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQKWV       0.2111111111111111
TCGA-A3-3316-0  HLA-A0201       KVTSIQLWV       1.3571428571428572
TCGA-A3-3316-0  HLA-A0201       KVTSIQMWV       1.727272727272727
TCGA-A3-3316-0  HLA-A0201       KVTSIQNWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQPWV       1.2666666666666666
TCGA-A3-3316-0  HLA-A0201       KVTSIQQWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQRWV       0.2533333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQSWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQTWV       0.76
TCGA-A3-3316-0  HLA-A0201       KVTSIQVWV       1.1874999999999998

score 1 · Answer 2 · 2018-04-20

For a problem like this one, it is unfathomable there is a python answer and not a perl answer. So here it is:

#!/usr/bin/env perl

use warnings;
use strict;

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
my %hash = ();

open(IN1, "<", $file1) or die "Can't open input file. Please provide a valid filename.\n";
while (<IN1>) {
    s/\R//;
    my @line = split( "\t", $_ );
    my ($affinity, @trash) = split( "<=", $line[3] );
    $hash{$line[2]} = $affinity;
}
close(IN1);

open(IN2, "<", $file2) or die "Can't open input file. Please provide a valid filename.\n";
while (<IN2>) {
    s/\R//;
    my @line = split( "\t", $_ );
    foreach my $key (keys %hash) {
        if ( hd( $key, $line[2] ) == 1 ){
            my $new_aff = $hash{$key} / $line[3];
            print "$line[2]\t$hash{$key}\t$line[3]\t$new_aff\n";
        }
    }
}
close(IN2);

# http://www.perlmonks.org/?node_id=500235
sub hd{ length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ tr[\0][\0] ) }

Save it as pep_affinity.pl, and call it with ./pep_affinity.pl file1 file2.

One assumption of the script is the the only peptides with a one amino-acid difference between file1 and file2 are the mutated ones - it doesn't specifically checks for the lower-case letters. It is pretty inefficient, as for every peptide in file2, every peptide in file1 is looped through to check for the 1 amino-acid difference.

score 0 · Answer 3 · 2018-04-21

in R:

df1= read.csv("test1.txt", sep = "\t", stringsAsFactors = F, header = F)
df2= read.csv("test2.txt", sep = "\t", stringsAsFactors = F, header = F)

library(stringr)

df1$V6=as.numeric(str_split_fixed(df1$V4,"<",3)[,1])

library(fuzzyjoin)
df3=stringdist_inner_join(df1, df2, by=c("V3"="V3"), max_dist=1,ignore_case=T)
df4=df3[,c(1:3,6,10)]
df4$V7=df4$V6/df4$V4.y

output:

             V1.x      V2.x      V3.x  V6 V4.y        V7
1  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.0 0.9500000
2  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.5 0.7600000
3  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  4.5 0.4222222
4  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
5  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  0.4 4.7500000
6  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  7.5 0.2533333
7  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.2 1.5833333
8  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  9.0 0.2111111
9  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.4 1.3571429
10 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.1 1.7272727
11 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
12 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.5 1.2666667
13 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
14 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  7.5 0.2533333
15 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
16 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.5 0.7600000
17 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.6 1.1875000