Compare two columns in several different files with Perl or Python
3
0
Entering edit mode
6.7 years ago

Hi All

I am a young scientist. I am trying to compare several tab delimited files which each file contains two different columns (first column is a list of ids and the second column is a list of numeric values assigned to ids) to find the match entries among them. I need to return only similar ids in all files along with similar values which this value can shift 20 digits to left or right (less or more).

Example:

       file 1                      file 2                     file 3

AYJT01000009.1  6703      AYJT01000009.1    6703        AYJT01000009.1  6713
AYJT01000020.1  3082      AYJT01000020.1    3082        AYJT01000020.1  3082
AYJT01000020.1  10479     AYJT01000045.1    4861        AYJT01000114.1  4191
AYJT01000045.1  4861      AYJT01000120.1    1003        AYJT01000118.1  2213
AYJT01000118.1  2209      AYJT01000123.1    3453        AYJT01000120.1  1003
AYJT01000120.1  1003      AYJT01000123.1    3453        AYJT01000123.1  1039       
AYJT01000123.1  3453      AYJT01000127.1    4084        AYJT01000123.1  3453
AYJT01000127.1  4405      AYJT01000146.1    121         AYJT01000127.1  4084
AYJT01000305.1  7736      AYJT01000305.1    7736        AYJT01000146.1  209
AYJT01000372.1  8646      AYJT01000372.1    8638        AYJT01000305.1  7736

output

     file 1                    file 2                 file 3 
AYJT01000009.1  6703    AYJT01000009.1  6703    AYJT01000009.1  6713
AYJT01000020.1  3082    AYJT01000020.1  3082    AYJT01000020.1  3082
AYJT01000120.1  1003    AYJT01000120.1  1003    AYJT01000120.1  1003
AYJT01000123.1  3453    AYJT01000123.1  3453    AYJT01000123.1  3453
AYJT01000305.1  7736    AYJT01000305.1  7736    AYJT01000305.1  7736
  • The value of AYJT01000009.1 has a 10 digits shift to up in file 3

I would be so appreciated in advance if any one could write me an script with perl or python. I am keeping my eyes open to see your comments.

Thanks

perl python • 8.3k views
ADD COMMENT
0
Entering edit mode

Which file has the value to compare the others to +/- 20?

ADD REPLY
0
Entering edit mode

all files contain ids and values. Each file contains more than 10 thousand lines

ADD REPLY
0
Entering edit mode

This question is so hard to understand and we already have two answers. Perhaps I am in the minority ...

ADD REPLY
0
Entering edit mode

While we're nitpicking, I don't know what a verdant scientist is. Please fix this oversight immediately.

ADD REPLY
4
Entering edit mode
6.7 years ago

Code (it can also handle more than three files):

import itertools

files = ['file1.txt', 'file2.txt', 'file3.txt']
d = {}

for fi, f in enumerate(files):
    fh = open(f)
    for line in fh:
        sl = line.split()
        name = sl[0]
        val = int(sl[1])
        if name not in d:
            d[name] = {}
        if fi not in d[name]:
            d[name][fi] = []
        d[name][fi].append(val)
    fh.close()

for name, vals in d.items():
    if len(vals) == len(files):
        for var in itertools.product(*vals.values()):
            if max(var) - min(var) <= 20:
                out = '{}\t{}'.format(name, "\t".join(map(str, var)))
                print(out)
                break

Output:

AYJT01000123.1  3453    3453    3453
AYJT01000305.1  7736    7736    7736
AYJT01000009.1  6703    6703    6713
AYJT01000020.1  3082    3082    3082
AYJT01000120.1  1003    1003    1003
ADD COMMENT
0
Entering edit mode
6.7 years ago
st.ph.n ★ 2.7k

From what I understood from the original post, this should work (probably a long-handed version). Change *.txt to whatever you need to, to grab each file.

#!/usr/bin/env python
import glob
from collections import defaultdict(list)

def open_file(file):
    with open(file, 'r') as f:
            for line in f:
                ids[line.strip().split('\t')[0]].append(int(line.strip().split('\t')[1]))

ids = defaultdict(list)
nfiles = 0
for file in glob.glob('*.txt'):
    nfiles += 1
    open_file(file)

for i in ids:
    if len(ids[i][0]) == 3:
            if len(set(ids[i][0])) == nfiles:
                print i, '\t', '\t'.join(ids[i][0])
            else:
                    if max[ids[i][0]] - min[ids[i][0]] <= 20:
                        print i, '\t', '\t'.join(ids[i][0])
ADD COMMENT
0
Entering edit mode
6.7 years ago

If you're using Python, you could use a set to build a list of keys common to all inputs, then do a second pass over inputs to write filtered outputs. Something like the following (untested) script:

#!/usr/bin/env python

in_fns = ['in1.txt', 'in2.txt', 'in3.txt']
# open read-only input handles
in_fhs = [open(in_fn, 'r') for in_fn in in_fns]
ks = set()
# populate ks with keys from each file
for in_handle in in_fhs:
    for line in in_handle:
        line = line.strip()
        (k, v) = line.split()
        ks.add(k)
    in_handle.close()
# open fresh read-only input handles
in_fhs = [open(in_fn, 'r') for in_fn in in_fns]
out_fns = ['out1.txt', 'out2.txt', 'out3.txt']
# open write-only output handles
out_fhs = [open(out_fn, 'w') for out_fn in out_fns]
# write filtered output based on contents of ks set
for idx, in_handle in enumerate(in_fhs):
    for line in in_handle:
        line = line.strip()
        (k, v) = line.split()
        if k in ks:
            out_fhs[idx].write("%s\n" % (line))
    in_handle.close()         
    out_fhs[idx].close()

This could be done with a command-line approach, but you asked for Python or Perl, so this is one way to do it.

ADD COMMENT
0
Entering edit mode

I am so sorry to made you confused. I am gonna add more explantation. So:

I have 3 different text files. Each file contains two columns. As you see in example, some IDs and values are similar in file 1, file 2 and file 3 (like AYJT01000020.1 3082, the second row) but although some other IDs are similar in all file but them values are different ( like AYJT01000009.1, its value is 6703 in file 1 and file 2 but 6713 in file 3).

output should have 6 columns (two columns for each file), similar ids and similar values which can be different up to +/-20 digits (6713 - 6703 = 10, less than 20 for AYJT01000009.1). In above example (output), there are 5 ids which are similar in all files, them values are similar.

ADD REPLY
0
Entering edit mode

AYJT01000020.1 is present twice in file1. It has value of 3082 and 10479. Can given id be present multiple times in one file?

ADD REPLY
0
Entering edit mode

Yes, Can be. Another one is AYJT01000123.1 (twice in file 3)

ADD REPLY
1
Entering edit mode

Okay, please see my answer.

ADD REPLY
0
Entering edit mode

WOW, works nice. Thanks so much. I really appreciate the time you have invested to write this code!

ADD REPLY

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6