Biostar Beta. Not for public use.
Merge two files depending on multiple matching columns
0
Entering edit mode
14 months ago
iya837 • 0

Hello,

Im trying to join two files depending on multiple matching columns. I want to compare columns 1,2,4,5 from file 1 with columns 1,2,4,5 from file 2 and then merge matching lines in file 3 with column 3 of file 1 and all columns from files 2. I've already tried several awk command. For example :

1)

awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1$2$4$5]=$3;next} $1$2$4$5 in a{print $0, a[$1$2$4$5]}' file2 file1 > file3

2)

awk 'NR==FNR {a[$1$2$4$5] = $3; next} $1$2$4$5 in a' file2 file1 >file3

but i'm getting empty output. Could anyone help me with this issue ? Thank you

An example of what my files look like

File 1 :

CHROM,POS,ID,REF,ALT
1,69270,rs201219564,A,G
1,69428,rs140739101,T,G
1,69496,rs150690004,G,A
1,69511,rs75062661,A,G

File 2 :

Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
1,69511,69511,A,C,0.031,0.450,D,0.002,0.090,B,0.001,0.040,B,-1.52,0.370,N,2.8350.402,21.5
1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
1,69511,69511,A,T,1.0,0.010,T,0.0,0.026,B,0.0,0.013,B,2.06,0.004,N,-1.639,0.014,0.002
1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69496,69496,G,C,0.348,0.124,T,0.988,0.604,D,0.865,0.597,P,-2.54,0.550,D,2.6520.380,20.5
1,69496,69496,G,T,0.177,0.222,T,0.999,0.764,D,0.96,0.686,D,-2.69,0.574,D,1.4030.232,12.80
1,69428,69428,T,A,0.878,0.026,T,0.981,0.577,D,0.899,0.620,P,-0.16,0.095,N,0.0240.096,2.821
1,69428,69428,T,C,0.001,0.784,D,0.995,0.657,D,0.969,0.703,D,-5.34,0.846,D,4.5330.607,24.3
1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7

Wanted Output :

ID,Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
rs140739101,1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
rs150690004,1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
rs75062661,1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
annotation snps • 199 views
ADD COMMENTlink
0
Entering edit mode

I'd recommend using Python or R, as writing an awk script to do this will mean you spend more time finding the right script than on solving the problem.

Once the task is complete, you can explore on other ways if you like, but this problem is sophisticated enough to warrant a better tool than awk.

ADD REPLYlink
0
Entering edit mode

Thank you for your answer. Indeed i can use R or python but the files are HUGE (> 10G) and im currently working on it via a distant computational cluster than i can access only by bash command.

ADD REPLYlink
0
Entering edit mode

Python or Perl (yes, Perl) would work OK. Just look for streaming approaches for File2, storing only File1 in memory.

ADD REPLYlink
0
Entering edit mode

For starters, you could repeat $2 from File1 so it more closely resembles File2. File1::$1,$2,$2,$4,$5 will be the same as File2::$1,$2,$3,$4,$5. Once that's done, you should be able to use join to get the content from both files together.

ADD REPLYlink
0
Entering edit mode

I'll give it a try. Thank you

ADD REPLYlink
2
Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

use linux join:

join -t ':' -1 1 -2 1  <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file1.csv | sort -t ':' -k1,1 ) <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file2.csv | sort -t ':' -k1,1 ) | cut -d ':' -f 2-

1,69428,rs140739101,T,G:1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
1,69496,rs150690004,G,A:1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69511,rs75062661,A,G:1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047

I leave the final selection of the columns and the header as an exercice.

ADD COMMENTlink
0
Entering edit mode
14 months ago
iya837 • 0

Thank You all for your answers. I finally solved my problem thanks to @steeldriver. It was due to line endings. For those who may have the same issue i put the command line below :

awk -F, 'BEGIN{RS="\r\n" ; OFS=FS} NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next} {ind = $1 FS $2 FS $4 FS $5} ind in a {print $3, a[ind]}' file2 file1

RS="\r\n" : convert endings in Unix style line endings

separating the fields with FS i the associative array key string just guards against false matches; if you just concatenate fields you can't distinguish between "abcdef" and "abc""def"

ADD COMMENTlink
0
Entering edit mode

Or you could just run dos2unix on the file with the \r\n line endings to convert the file to Unix format with \n line endings.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1