Biostar Beta. Not for public use.
intersection for several file
0
Entering edit mode
12 months ago

Dear all,

If I have 6 files, each file only has one column (gene ID), I want to extracted the intersection for those 6 files, which command should I use? I know the below command can work for 2 files, but not for 6 files:

cat file1 file2 | sort | uniq -d > intersection.out

Thanks and have a great day!

ADD COMMENTlink
0
Entering edit mode

what about:

cat file1 file2 file3 file4 file5 file6 | sort | uniq -c | perl -lae 'print "$F[1]" if ($F[0] == 6)' > intersection.out

ADD REPLYlink
1
Entering edit mode
12 months ago
Canada

You pretty much had it, cat file* | sort | uniq -d > intersection.out will work for all files starting with file in the current directory.

If you need to find multiple files in different subdirectories, change the cat command for find /path/to/files -name "file*" -exec cat {} \;.

ADD COMMENTlink
0
Entering edit mode

If by intersection OP means that the line has to be in every file, then the above does not work..

However (the same than JC's answer really)..

cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'
ADD REPLYlink
0
Entering edit mode

It worked! Thank you

ADD REPLYlink
0
Entering edit mode

One caveat; if you have duplicates in your input files (i.e. if one file can have the same gene more than once) then you need to do something like sort | uniq before combining files, otherwise you'll get false postiives/negatives by looking for exactly 6 matches.

find /path/to/files -name "file*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; | sort | uniq -c | awk '{if($1==6){print $2}}'

Also the awk part explicitly looks 6 matches, so this won't work for arbitrary number of files but rather if there's exactly 6 files. Not a big deal if this is just a one-off use though.

ADD REPLYlink
0
Entering edit mode

Thanks for you kindly suggestions! I checked 6 files just now and they don't have replicates, so is it means the "cat file* | sort | uniq -c | awk '{if($1==6){print $2}}'" results is realiable?:) and I also have a small question, it we want to use 4 files for example, maybe can simply change awk '{if($1==6){print $2}}' to awk '{if($1==4){print $2}}'..?

Thanks and have a great day!!

ADD REPLYlink
0
Entering edit mode

Yes should be good then, as far as I can see.

And yes, changing 6 to 4 would work if you have 4 files. If you think you'll have to do this in the future, you might want to consider writing a short bash script. Something like this works for arbitrary number of matching files:

#!/bin/bash
set -eu

## Generate test data with:
# yes | head -n100  | xargs -I@  bash -c 'echo $RANDOM'  | grep -o . | head -n100 | split -l25 - file 

PREFIX=$1 # For example, files starting with "file"

NFILES=$(find . -name "${PREFIX}*" -type f | wc -l) # Get the number of files

find . -name "${PREFIX}*" -type f -exec bash -c ' sort $1 | uniq ' _ {} \; \
    | sort \
    | uniq -c \
    | awk -v nfiles="$NFILES" '{if($1==nfiles){print $2}}'

That way you can add your own sanity checks to make sure things like duplicates, number of files etc. are all as expected, plus it limits risks of mistyping a command.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1