diFST with awk - subtract and divide column value by their average and stdev
0
0
Entering edit mode
6.5 years ago

Hi everyone,

I am trying to create a simple script to sum z-scores of several pairwise FST comparisons (basically, replicate the diFST statistic from this paper: http://www.pnas.org/content/107/3/1160.full)

I already found awk scripts to calculate average and standard deviation of a field, but I was looking for something more specific: integrate these values within the script itself. It would be something such as this:

DATASET
0.19393939  0.00000000  0.00000000
0.00000000  0.00000000  0.07407407
0.13658455  0.08052603  0.06750392
0.03704075  0.09428103  0.05263158
0.06046107  0.02883006  0.05263158
0.07591252  0.03383722  0.01593773
0.10475719  0.02040816  0.09956710
0.05010618  0.12556459  0.11561691
0.09554731  0.24450549  0.17279412
0.04499541  0.04938272  0.03703704

MOCK-UP SCRIPT
awk '{print ($1-(average$1))/(stdev$1) + ($2-(average$2))/(stdev$2) + ($3-(average$3))/(stdev$3)}'

OUTPUT
-0.2179645555
-2.2434375436
1.1659017773
-0.7342745412
-1.1920991459
-1.5720200178
0.419356693
1.1658334403
4.7111786566
-1.5024747631

If you could help with just the awk '{print ($1-(average$1))/(stdev$1)}' the rest I can easily do. Thank you for the help!

awk unix average standard deviation z-score • 2.9k views
ADD COMMENT
0
Entering edit mode

how about using R ?

ADD REPLY
0
Entering edit mode

Hi Pierre, the reason I asked for awk is because I'm used to doing most simple calculations in this language.

ADD REPLY
0
Entering edit mode

R is good for handling row-oriented data, it's not the tool of choice for columns/whole-file.

ADD REPLY
0
Entering edit mode

This outputs the mean, variance, and sd. I'm not sure what you are doing after that (?)

awk '{sum=0; meandiff=0; variance=0; sd=0; z=0; for(i=1; i<=NF; i++) {sum+=$i; sumSq+=($i*$i)}; mean=sum/NF; for(i=1; i<=NF; i++) {meandiff+=(($i-mean)**2)}; variance=(meandiff/(NF-1)); sd=sqrt(variance); print mean"\t"variance"\t"sd}' DATASET
ADD REPLY
0
Entering edit mode

Hi Kevin,

My objective is to incorporate the mean and standard deviation of the column into the formula itself. For example, given a column $1, do:

($1-average$1)/(stdev$1)

Basically, I want to convert each value of $1 into a zscore (for example, following this: https://statistics.laerd.com/statistical-guides/standard-score-2.php). I know how to do this in several steps, first calculating the mean and standard deviation separately for the column and then incorporate these values into awk, but I imagine there must be a way to do this in a single step.

ADD REPLY
1
Entering edit mode

I see, I wrote my awk code above for summarising rows (not columns).

It may be complex in awk as you'd have to read over the file multiple times. The syntax in awk or that is: awk 'NR=FNR {first pass actions; next} {second pass actions}' DATASET DATASET

As Pierre said, what about R?

Here's a R script that should be executable in your shell that will convert your data to Z scores:

#!/usr/bin/env Rscript
result <- t(scale(t(read.table("DATASET"))))
write.table(result, "DATASET.Z.tsv", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
result
              V1         V2          V3
 [1,]  1.1547005 -0.5773503 -0.57735027
 [2,] -0.5773503 -0.5773503  1.15470054
 [3,]  1.1363896 -0.3908140 -0.74557564
 [4,] -0.8203855  1.1139155 -0.29352998
 [5,]  0.7984454 -1.1216241  0.32317870
 [6,]  1.1048403 -0.2617382 -0.84310206
 [7,]  0.6313757 -1.1529593  0.52158360
 [8,] -1.1461710  0.6944072  0.45176383
 [9,] -1.0121541  0.9873858  0.02476832
[10,]  0.1902031  0.8912387 -1.08144180
ADD REPLY
1
Entering edit mode

Hi Kevin,

Thanks for the script! That does exactly what I needed!

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6