Question

diFST with awk - subtract and divide column value by their average and stdev

0

Entering edit mode

6.5 years ago

pedroamandrade ▴ 40

Hi everyone,

I am trying to create a simple script to sum z-scores of several pairwise FST comparisons (basically, replicate the diFST statistic from this paper: http://www.pnas.org/content/107/3/1160.full)

I already found awk scripts to calculate average and standard deviation of a field, but I was looking for something more specific: integrate these values within the script itself. It would be something such as this:

DATASET
0.19393939  0.00000000  0.00000000
0.00000000  0.00000000  0.07407407
0.13658455  0.08052603  0.06750392
0.03704075  0.09428103  0.05263158
0.06046107  0.02883006  0.05263158
0.07591252  0.03383722  0.01593773
0.10475719  0.02040816  0.09956710
0.05010618  0.12556459  0.11561691
0.09554731  0.24450549  0.17279412
0.04499541  0.04938272  0.03703704

MOCK-UP SCRIPT
awk '{print ($1-(average$1))/(stdev$1) + ($2-(average$2))/(stdev$2) + ($3-(average$3))/(stdev$3)}'

OUTPUT
-0.2179645555
-2.2434375436
1.1659017773
-0.7342745412
-1.1920991459
-1.5720200178
0.419356693
1.1658334403
4.7111786566
-1.5024747631

If you could help with just the awk '{print ($1-(average$1))/(stdev$1)}' the rest I can easily do. Thank you for the help!

awk unix average standard deviation z-score • 2.9k views

ADD COMMENT • link updated 6.3 years ago by Biostar 20 • written 6.5 years ago by pedroamandrade ▴ 40

0

Entering edit mode

how about using R ?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, the reason I asked for awk is because I'm used to doing most simple calculations in this language.

ADD REPLY • link 6.5 years ago by pedroamandrade ▴ 40

0

Entering edit mode

R is good for handling row-oriented data, it's not the tool of choice for columns/whole-file.

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This outputs the mean, variance, and sd. I'm not sure what you are doing after that (?)

awk '{sum=0; meandiff=0; variance=0; sd=0; z=0; for(i=1; i<=NF; i++) {sum+=$i; sumSq+=($i*$i)}; mean=sum/NF; for(i=1; i<=NF; i++) {meandiff+=(($i-mean)**2)}; variance=(meandiff/(NF-1)); sd=sqrt(variance); print mean"\t"variance"\t"sd}' DATASET

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

My objective is to incorporate the mean and standard deviation of the column into the formula itself. For example, given a column $1, do:

($1-average$1)/(stdev$1)

Basically, I want to convert each value of $1 into a zscore (for example, following this: https://statistics.laerd.com/statistical-guides/standard-score-2.php). I know how to do this in several steps, first calculating the mean and standard deviation separately for the column and then incorporate these values into awk, but I imagine there must be a way to do this in a single step.

ADD REPLY • link 6.5 years ago by pedroamandrade ▴ 40

1

Entering edit mode

I see, I wrote my awk code above for summarising rows (not columns).

It may be complex in awk as you'd have to read over the file multiple times. The syntax in awk or that is: awk 'NR=FNR {first pass actions; next} {second pass actions}' DATASET DATASET

As Pierre said, what about R?

Here's a R script that should be executable in your shell that will convert your data to Z scores:

#!/usr/bin/env Rscript
result <- t(scale(t(read.table("DATASET"))))
write.table(result, "DATASET.Z.tsv", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
result
              V1         V2          V3
 [1,]  1.1547005 -0.5773503 -0.57735027
 [2,] -0.5773503 -0.5773503  1.15470054
 [3,]  1.1363896 -0.3908140 -0.74557564
 [4,] -0.8203855  1.1139155 -0.29352998
 [5,]  0.7984454 -1.1216241  0.32317870
 [6,]  1.1048403 -0.2617382 -0.84310206
 [7,]  0.6313757 -1.1529593  0.52158360
 [8,] -1.1461710  0.6944072  0.45176383
 [9,] -1.0121541  0.9873858  0.02476832
[10,]  0.1902031  0.8912387 -1.08144180

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Hi Kevin,

Thanks for the script! That does exactly what I needed!

ADD REPLY • link 6.5 years ago by pedroamandrade ▴ 40