Biostars beta testing.
Question: intersection for several file
0
Entering edit mode

Dear all,

If I have 6 files, each file only has one column (gene ID), I want to extracted the intersection for those 6 files, which command should I use? I know the below command can work for 2 files, but not for 6 files:

cat file1 file2 | sort | uniq -d > intersection.out

Thanks and have a great day!

ADD COMMENTlink 9 months ago mxlsherry1992 • 10 • updated 9 months ago manuel.belmadani • 830
Entering edit mode
0

what about:

cat file1 file2 file3 file4 file5 file6 | sort | uniq -c | perl -lae 'print "$F[1]" if ($F[0] == 6)' > intersection.out

ADD REPLYlink 9 months ago
JC
7.9k
1
Entering edit mode

You pretty much had it, cat file* | sort | uniq -d > intersection.out will work for all files starting with file in the current directory.

If you need to find multiple files in different subdirectories, change the cat command for find /path/to/files -name "file*" -exec cat {} \;.

ADD COMMENTlink 9 months ago manuel.belmadani • 830
Entering edit mode
0

If by intersection OP means that the line has to be in every file, then the above does not work..

However (the same than JC's answer really)..

cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'
ADD REPLYlink 9 months ago
5heikki
8.4k
Entering edit mode
0

It worked! Thank you

ADD REPLYlink 9 months ago
mxlsherry1992
• 10
Entering edit mode
0

One caveat; if you have duplicates in your input files (i.e. if one file can have the same gene more than once) then you need to do something like sort | uniq before combining files, otherwise you'll get false postiives/negatives by looking for exactly 6 matches.

find /path/to/files -name "file*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; | sort | uniq -c | awk '{if($1==6){print $2}}'

Also the awk part explicitly looks 6 matches, so this won't work for arbitrary number of files but rather if there's exactly 6 files. Not a big deal if this is just a one-off use though.

ADD REPLYlink 8 months ago
manuel.belmadani
• 830
Entering edit mode
0

Thanks for you kindly suggestions! I checked 6 files just now and they don't have replicates, so is it means the "cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'" results is realiable?:) and I also have a small question, it we want to use 4 files for example, maybe can simply change awk '{if($1==6){print $2}}' to awk '{if($1==4){print $2}}'..?

Thanks and have a great day!!

ADD REPLYlink 8 months ago
mxlsherry1992
• 10
Entering edit mode
0

Yes should be good then, as far as I can see.

And yes, changing 6 to 4 would work if you have 4 files. If you think you'll have to do this in the future, you might want to consider writing a short bash script. Something like this works for arbitrary number of matching files:

#!/bin/bash
set -eu

## Generate test data with:
# yes | head -n100  | xargs -I@  bash -c 'echo $RANDOM'  | grep -o . | head -n100 | split -l25 - file 

PREFIX=$1 # For example, files starting with "file"

NFILES=$(find . -name "${PREFIX}*" -type f | wc -l) # Get the number of files

find . -name "${PREFIX}*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; \
    | sort \
    | uniq -c \
    | awk -v nfiles="$NFILES" '{if($1==nfiles){print $2}}'

That way you can add your own sanity checks to make sure things like duplicates, number of files etc. are all as expected, plus it limits risks of mistyping a command.

ADD REPLYlink 8 months ago
manuel.belmadani
• 830

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0