intersection for several file
1
0
Entering edit mode
4.9 years ago

Dear all,

If I have 6 files, each file only has one column (gene ID), I want to extracted the intersection for those 6 files, which command should I use? I know the below command can work for 2 files, but not for 6 files:

cat file1 file2 | sort | uniq -d > intersection.out

Thanks and have a great day!

RNA-Seq sequencing script • 2.8k views
ADD COMMENT
0
Entering edit mode

what about:

cat file1 file2 file3 file4 file5 file6 | sort | uniq -c | perl -lae 'print "$F[1]" if ($F[0] == 6)' > intersection.out

ADD REPLY
1
Entering edit mode
4.9 years ago

You pretty much had it, cat file* | sort | uniq -d > intersection.out will work for all files starting with file in the current directory.

If you need to find multiple files in different subdirectories, change the cat command for find /path/to/files -name "file*" -exec cat {} \;.

ADD COMMENT
0
Entering edit mode

If by intersection OP means that the line has to be in every file, then the above does not work..

However (the same than JC's answer really)..

cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'
ADD REPLY
0
Entering edit mode

It worked! Thank you

ADD REPLY
0
Entering edit mode

One caveat; if you have duplicates in your input files (i.e. if one file can have the same gene more than once) then you need to do something like sort | uniq before combining files, otherwise you'll get false postiives/negatives by looking for exactly 6 matches.

find /path/to/files -name "file*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; | sort | uniq -c | awk '{if($1==6){print $2}}'

Also the awk part explicitly looks 6 matches, so this won't work for arbitrary number of files but rather if there's exactly 6 files. Not a big deal if this is just a one-off use though.

ADD REPLY
0
Entering edit mode

Thanks for you kindly suggestions! I checked 6 files just now and they don't have replicates, so is it means the "cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'" results is realiable?:) and I also have a small question, it we want to use 4 files for example, maybe can simply change awk '{if($1==6){print $2}}' to awk '{if($1==4){print $2}}'..?

Thanks and have a great day!!

ADD REPLY
0
Entering edit mode

Yes should be good then, as far as I can see.

And yes, changing 6 to 4 would work if you have 4 files. If you think you'll have to do this in the future, you might want to consider writing a short bash script. Something like this works for arbitrary number of matching files:

#!/bin/bash
set -eu

## Generate test data with:
# yes | head -n100  | xargs -I@  bash -c 'echo $RANDOM'  | grep -o . | head -n100 | split -l25 - file 

PREFIX=$1 # For example, files starting with "file"

NFILES=$(find . -name "${PREFIX}*" -type f | wc -l) # Get the number of files

find . -name "${PREFIX}*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; \
    | sort \
    | uniq -c \
    | awk -v nfiles="$NFILES" '{if($1==nfiles){print $2}}'

That way you can add your own sanity checks to make sure things like duplicates, number of files etc. are all as expected, plus it limits risks of mistyping a command.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6