Question

Extract and retain only chromosome info in GWAS sumstats file

1

Entering edit mode

4.9 years ago

camerond ▴ 190

I have a GWAS sumstats file called sumstats.txt in the following format:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10:101574552:A:ATG      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10:10222597:AT:A        0       10      10222597        a       at      0.9997  0.01    0.9777

For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.

The final file would look like this:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10      0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10      0       10      10222597        a       at      0.9997  0.01    0.9777

My awk/sed knowledge is not great but I've had a bash (npi!!):

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'

Giving:

10   100968448       100968449       T/AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812

But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried:

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'

Returning:

I'm looking for a bash/awk/sed one liner for this as I have quite a few sumstats files to process.

Any suggestions would be greatly appreciated.

bash GWAS sumstats awk sed munging • 934 views

ADD COMMENT • link updated 4.9 years ago by shawn.w.foley ★ 1.3k • written 4.9 years ago by camerond ▴ 190

1

Entering edit mode

4.9 years ago

Pierre Lindenbaum 161k

something like this ?

sed 's/^\([^:]*\):[^ \t]*/\1\t/'

ADD COMMENT • link 4.9 years ago by Pierre Lindenbaum 161k

score 3 · Accepted Answer · 2019-05-24

It looks like your goal is to split the first field on the colon, then print the rest of the file. Your awk script is splitting the line not the field. You can try:

awk 'BEGIN{OFS="\t";FS="\t"} {split($1,ARR,":"); $1 = ARR[1]; print $0}' inFile.txt > outFile.txt

This will yield:

SNP Freq.A1 CHR BP  A1  A2  OR  SE  P
10  0.3519  10  100968448   t   aa  1.0024  0.01    0.812
10  0.4493  10  101574552   a   atg 0.98906 0.0097  0.2585
10  0   10  10222597    a   at  0.9997  0.01    0.9777

The awk script is doing a few things:

BEGIN{OFS="\t";FS="\t"} is defining the Output Field Separator and Field Separator as tabs.

split($1,ARR,":") splits just the first field on the colon, and stores each element in the array ARR

$1 = ARR[1] redefines the first field as the first array element (just the chromosome in this case).

print $0 prints the entire line (with the newly defined first field).