Question

Log Transformation of Raw data before Robust Z-score analysis

0

Entering edit mode

3.5 years ago

oliver.hickman • 0

I have been given conflicting advice on how to analyse data.

I have cell counts for as a readout of growth viablility. Data is in duplicate, thousands of conditions, multiple cell lines - a large screening experiment.

I need to perform some form of Z-score analysis, plate by plate, to identify and rank hits.

I have performed a Z-score analysis on the raw data, and it looks fantastic. I have plotted distributions and generated hit lists - all fine.

However, I have been told I absolutely need to perform robust Z-score analysis instead, based on the MAD and the Abs values of the raw data - I have done this, checked it, and it looks terrible.

The Hit lists make no sense when compared to the raw data and the percent of control data which I am using as a sanity check - the hit lists barely overlap with my normal Z-scores, and the new hits look like garbage.

Furthermore, someone else has suggested I need to do the robust Z-score analysis from the log transformed raw values.

Does anyone have any input as to what I should be doing, why the Robust Zs might make no sense and be so radically different from the Z scores, and if Log transformation of raw data will help or is necessary to fix this.

Thanks in advance.

cells Z scores statistics Robust Z-scores cancer • 3.7k views

ADD COMMENT • link updated 3.5 years ago by i.sudbery 19k • written 3.5 years ago by oliver.hickman • 0

score 0 · Answer 1 · 2020-10-16

Firstly, if you are working on count data, then working in the log scale is generally recommend as long as your counts are sufficiently high, unless you are going to use specifically count based statistics. Counts are often very roughtly log-normal distributed.

There are good theoretical reasons to favour robust Z-score analysis over normal Z scores, but I wouldn't have expected it to make _that_ much of a difference to the ranking of hits. Indeed, the ranking within plates should be identical under the two schemes and all that should change is how plates compare to each other.

How are you calculating the robust Z score? I would usually calculate it _either_:

robust_z_i = (x[i]-median(x))/mad(x)

or

robust_z_i = (x[i]-mean(x, trim=0.05))/mad(x)

Finally, have you considered using a pacakge specifically designed for analysing plate based high-thorughput screens, like cellHTS?

In the end, as with all things in biology, I would trust the controls over any theoretical considerations.