Is there any I can match peptides differing by only one letter in two different TSV files? My original peptides in file 1 contains a point mutation represented by a lowercase letter. File 2 contains mutated versions of the same peptides, but changed to the remaining 19 amino acids in the same location as the lowercase letter.
If there is a match I would also like to divide the binding affinity of the original in File 1, column 4 by the binding affinity of the mutations in File 2, column 4.
File 1:
TCGA-A3-3316-0 HLA-A0201 SLVFLPFnT 1.90<=WB
TCGA-A3-3316-0 HLA-A0201 KLLLAtKSL 1.70<=WB
TCGA-A3-3316-0 HLA-A0201 sIWKHATPV 0.30<=SB
TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.90<=WB
TCGA-A3-3316-0 HLA-A0201 MtYDRYVAI 1.10<=WB
File 2:
TCGA-A3-3316-0 HLA-A0201 KVTSIQAWV 2
TCGA-A3-3316-0 HLA-A0201 KVTSIQCWV 2.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQDWV 4.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQEWV 3.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQFWV 0.4
TCGA-A3-3316-0 HLA-A0201 KVTSIQGWV 7.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQIWV 1.2
TCGA-A3-3316-0 HLA-A0201 KVTSIQKWV 9
TCGA-A3-3316-0 HLA-A0201 KVTSIQLWV 1.4
TCGA-A3-3316-0 HLA-A0201 KVTSIQMWV 1.1
TCGA-A3-3316-0 HLA-A0201 KVTSIQNWV 3.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQPWV 1.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQQWV 3.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQRWV 7.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQSWV 3.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQTWV 2.5
TCGA-A3-3316-0 HLA-A0201 KVTSIQVWV 1.6
You're going to need a custom script for it. For each entry in file1, you can replace the lower case character by a
.
, then use that as a word-bounded regex to get matching lines from your file2.I'm not giving you code as this is an excellent opportunity to try your hand at some text processing code. I'd recommend Python because it's easier to code in.