Question

text file editing

0

Entering edit mode

5.8 years ago

Sam ▴ 150

Dear ALL

I have about 200 tab delimited and in first step I want merge them to have a comprehensive file, then want to summarize file to have another out put, so in summarize file just keep unique ID in 1st ,2nd and 3rd columns and for pos1 and pos2 add the number of number of lines with repetitive ID after dash symbol respectively but for other columns convert tab to / in each line and add the data of lines with repetitive ID after dash symbol respectively. I would greatly appreciate it if you kindly give me some feedback on this.

Thanks

annoPos gene  chr     pos1   pos2  ref    alt   genotype    quality MQ  ref-depth   alt-depth             
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411   82411   G   T   hom 84.5    60  0   4
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82437   82437   A   C   hom 87.5    60  0   4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195   84195   C   T   hom 183 60  0   17
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84209   84209   G   C   hom 182 60  0   17
exonic  Eucgr.A00214.1.v2.0 Chr01   84631   84631   C   T   het 118 60  5   7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718   84718   C   T   het 80  60  7   5
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84754   84754   C   T   het 85  60  7   5
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   89492   89492   C   T   het 78  60  2   4

convert to this :

annoPos gene                            chr pos1    pos2    ref alt genotype    quality MQ  ref-depth   alt-depth 

 UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   82411-82437         82411-82437         G/T-A/C         hom-hom     0-0 4-4
 exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   84195-84209-84631   84195-84209-84631   C/T-G/C-C/T hom-hom-het 0-0-5 17-17-7
intronic Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0    Chr01   84718-84754-89492   84718-84754-89492   C/T-C/T-C/T het-het-het 7-7-2 5-5-4

awk bash • 1.4k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 161k • written 5.8 years ago by Sam ▴ 150

score 4 · Answer 1 · 2018-06-28

@OP. Could you please clarify on alleles and quality?

UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411   82411   G   T   hom 84.5    60  0   4
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82437   82437   A   C   hom 87.5    60  0   4

For these two lines, expected resultant alleles are G/T-A/C. Is it G/T-A/C or G/A-T/C?

If the alleles are G/T-A/C (as in OP):

$ awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6"/"$7,$8,$11,$12}' test.txt | datamash -g1 unique 2 unique 3 collapse 4-9   | awk -v OFS="\t" '{gsub(",","-");gsub("-",",",$2)}1' 

AnnoPos gene    chr pos1    pos2    ref/alt genotype    ref-depth   alt-depth
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411-82437 82411-82437 G/T-A/C hom-hom 0-0 4-4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195-84209-84631   84195-84209-84631   C/T-G/C-C/T hom-hom-het 0-0-5   17-17-7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718-84754-89492   84718-84754-89492   C/T-C/T-C/T het-het-het 7-7-2   5-5-4

ps: screenshot in calc is attached here.

If the alleles are G/A-T/C, try this:

$ datamash -g1 unique 2 unique 3 collapse 4-8,11,12 < test.txt  | awk -v OFS="\t" '{gsub(",","-"); gsub("-","/",$6); gsub("-","/",$7);gsub("-",",",$2); print $1,$2,$3,$4,$5,$6"-"$7,$8,$9,$10}'

AnnoPos gene    chr pos1    pos2    ref-alt genotype    ref-depth   alt-depth
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411-82437 82411-82437 G/A-T/C hom-hom 0-0 4-4
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195-84209-84631   84195-84209-84631   C/G/C-T/C/T hom-hom-het 0-0-5   17-17-7
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718-84754-89492   84718-84754-89492   C/C/C-T/T/T het-het-het 7-7-2   5-5-4

ps: Display is not optimal. Took a screenshot of the tsv in calc.

ps2: Datamash is available in most of the linux repos, conda channels and home brew linux

score 3 · Answer 2 · 2018-06-28

using awk+sqlite3 , this is not exactly the desired output but I'm lazy, and you get the idea: using the group_by function and group_concat

rm -f tmp.db && tr "-" "_" < input.tsv | awk '/^annoPos/ {split($0,header);printf("create table T(");for(i=1;i<= NF;i++) printf("%s %s TEXT",(i==1?"":","),header[i]); printf(");\n"); next;} {printf("insert into T(");for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":","),header[i]); printf(") values("); for(i=1;i<= NF;i++) printf("%s\"%s\"",(i==1?"":","),$i); printf(");\n");} END{printf("select annoPos,gene,chr,group_concat(pos1),group_concat(pos2),group_concat(ref),group_concat(alt) from T group by annoPos,gene,chr;\n");}'   | sqlite3  -header -separator $'\t'  tmp.db

annoPos gene    chr group_concat(pos1)  group_concat(pos2)  group_concat(ref)   group_concat(alt)
UTR3    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   82411,82437 82411,82437 G,A T,C
exonic  Eucgr.A00214.1.v2.0 Chr01   84631   84631   C   T
exonic  Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84195,84209 84195,84209 C,G T,C
intronic    Eucgr.A00214.1.v2.0,Eucgr.A00214.2.v2.0 Chr01   84718,84754,89492   84718,84754,89492   C,C,C   T,T,T