Merge two files depending on multiple matching columns
2
0
Entering edit mode
5.0 years ago
iya837 • 0

Hello,

Im trying to join two files depending on multiple matching columns. I want to compare columns 1,2,4,5 from file 1 with columns 1,2,4,5 from file 2 and then merge matching lines in file 3 with column 3 of file 1 and all columns from files 2. I've already tried several awk command. For example :

1)

awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1$2$4$5]=$3;next} $1$2$4$5 in a{print $0, a[$1$2$4$5]}' file2 file1 > file3

2)

awk 'NR==FNR {a[$1$2$4$5] = $3; next} $1$2$4$5 in a' file2 file1 >file3

but i'm getting empty output. Could anyone help me with this issue ? Thank you

An example of what my files look like

File 1 :

CHROM,POS,ID,REF,ALT
1,69270,rs201219564,A,G
1,69428,rs140739101,T,G
1,69496,rs150690004,G,A
1,69511,rs75062661,A,G

File 2 :

Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
1,69511,69511,A,C,0.031,0.450,D,0.002,0.090,B,0.001,0.040,B,-1.52,0.370,N,2.8350.402,21.5
1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
1,69511,69511,A,T,1.0,0.010,T,0.0,0.026,B,0.0,0.013,B,2.06,0.004,N,-1.639,0.014,0.002
1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69496,69496,G,C,0.348,0.124,T,0.988,0.604,D,0.865,0.597,P,-2.54,0.550,D,2.6520.380,20.5
1,69496,69496,G,T,0.177,0.222,T,0.999,0.764,D,0.96,0.686,D,-2.69,0.574,D,1.4030.232,12.80
1,69428,69428,T,A,0.878,0.026,T,0.981,0.577,D,0.899,0.620,P,-0.16,0.095,N,0.0240.096,2.821
1,69428,69428,T,C,0.001,0.784,D,0.995,0.657,D,0.969,0.703,D,-5.34,0.846,D,4.5330.607,24.3
1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7

Wanted Output :

ID,Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
rs140739101,1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
rs150690004,1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
rs75062661,1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
annotation snps • 8.6k views
ADD COMMENT
0
Entering edit mode

I'd recommend using Python or R, as writing an awk script to do this will mean you spend more time finding the right script than on solving the problem.

Once the task is complete, you can explore on other ways if you like, but this problem is sophisticated enough to warrant a better tool than awk.

ADD REPLY
0
Entering edit mode

Thank you for your answer. Indeed i can use R or python but the files are HUGE (> 10G) and im currently working on it via a distant computational cluster than i can access only by bash command.

ADD REPLY
0
Entering edit mode

Python or Perl (yes, Perl) would work OK. Just look for streaming approaches for File2, storing only File1 in memory.

ADD REPLY
0
Entering edit mode

For starters, you could repeat $2 from File1 so it more closely resembles File2. File1::$1,$2,$2,$4,$5 will be the same as File2::$1,$2,$3,$4,$5. Once that's done, you should be able to use join to get the content from both files together.

ADD REPLY
0
Entering edit mode

I'll give it a try. Thank you

ADD REPLY
2
Entering edit mode
5.0 years ago

use linux join:

join -t ':' -1 1 -2 1  <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file1.csv | sort -t ':' -k1,1 ) <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file2.csv | sort -t ':' -k1,1 ) | cut -d ':' -f 2-

1,69428,rs140739101,T,G:1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
1,69496,rs150690004,G,A:1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69511,rs75062661,A,G:1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047

I leave the final selection of the columns and the header as an exercice.

ADD COMMENT
0
Entering edit mode
5.0 years ago
iya837 • 0

Thank You all for your answers. I finally solved my problem thanks to @steeldriver. It was due to line endings. For those who may have the same issue i put the command line below :

awk -F, 'BEGIN{RS="\r\n" ; OFS=FS} NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next} {ind = $1 FS $2 FS $4 FS $5} ind in a {print $3, a[ind]}' file2 file1

RS="\r\n" : convert endings in Unix style line endings

separating the fields with FS i the associative array key string just guards against false matches; if you just concatenate fields you can't distinguish between "abcdef" and "abc""def"

ADD COMMENT
0
Entering edit mode

Or you could just run dos2unix on the file with the \r\n line endings to convert the file to Unix format with \n line endings.

ADD REPLY
0
Entering edit mode

Thank You Thank You Thank You Thank You Thank You Thank You Thank You Thank You Thank You Thank You

ADD REPLY

Login before adding your answer.

Traffic: 2878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6