I have a GWAS sumstats file called sumstats.txt
in the following format:
SNP Freq.A1 CHR BP A1 A2 OR SE P
10:100968448:T:AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812
10:101574552:A:ATG 0.4493 10 101574552 a atg 0.98906 0.0097 0.2585
10:10222597:AT:A 0 10 10222597 a at 0.9997 0.01 0.9777
For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.
The final file would look like this:
SNP Freq.A1 CHR BP A1 A2 OR SE P
10 0.3519 10 100968448 t aa 1.0024 0.01 0.812
10 0.4493 10 101574552 a atg 0.98906 0.0097 0.2585
10 0 10 10222597 a at 0.9997 0.01 0.9777
My awk
/sed
knowledge is not great but I've had a bash (npi!!):
echo "10:100968448:T:AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'
Giving:
10 100968448 100968449 T/AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812
But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried:
echo "10:100968448:T:AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'
Returning:
10
I'm looking for a bash
/awk
/sed
one liner for this as I have quite a few sumstats files to process.
Any suggestions would be greatly appreciated.
Many Thanks @shawn.w.foley
Your explanation will help me on future munging attempts.