Z-score calculation from raw counts data
1
3
Entering edit mode
5.6 years ago
Biologist ▴ 290

I have RNA-Seq data with raw counts in dataframe 'counts' having samples as columns and genes as rows. It looks something like below:

                           Sample1    Sample2       Sample3    Sample4       Sample5
ENSG00000243485.5            0            0            0            1            0
ENSG00000237613.2            0            0            0            0            0
ENSG00000238009.6           10            5            6           16           13
ENSG00000239945.1            0            0            0            0            0
ENSG00000239906.1            1            0            0            9            0
ENSG00000241860.6           16           65           19           56           82

And the sample annotation looks like below:

Samples      N_stage          M_stage
Sample1         N0               M1
Sample2         N1               M0
Sample3         N0               MX
Sample4         N2               M0
Sample5         N3               MX

I wanted to make a heatmap to show the sample stages. So, first I converted counts to logCPM.

logCPM <- cpm(counts, prior.count=2, log=TRUE)

logCPM looks like below:

                      Sample1      Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5    0.3915381    0.3915381    0.3915381    1.0122705    0.3915381
ENSG00000237613.2    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000238009.6    2.7079614    2.1523650    2.3686665    3.6549466    3.5064449
ENSG00000239945.1    0.3915381    0.3915381    0.3915381    0.3915381    0.3915381
ENSG00000239906.1    0.8750013    0.3915381    0.3915381    2.9372348    0.3915381
ENSG00000241860.6    3.2731113    5.3940606    3.7562189    5.3507849    6.0161469

Converted logCPM to Z-score.

Z-Score <- t(scale(t(logCPM)))

                     Sample1       Sample2      Sample3      Sample4      Sample5
ENSG00000243485.5  -0.34352108  -0.34352108  -0.34352108   2.46775178  -0.34352108
ENSG00000237613.2  -0.07930516  -0.07930516  -0.07930516  -0.07930516  -0.07930516
ENSG00000238009.6  -0.40081457  -1.01397031  -0.77526011   0.64427767   0.48039136
ENSG00000239945.1   0.19393322   0.19393322   0.19393322   0.19393322   0.19393322
ENSG00000239906.1   0.11017422  -0.78784826  -0.78784826   3.94072835  -0.78784826
ENSG00000241860.6  -1.35704877   0.13799168  -1.01650998   0.10748697   0.57649536

Now I calculated IQR and selected top 10% genes for plotting. Is the above Z-score calculation is right? Little confused to understand. How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5? For the same gene logCPM is more than 2 in all samples. Why it is negative in Z-score?

RNA-Seq heatmap zscore logcpm • 6.1k views
ADD COMMENT
1
Entering edit mode
5.6 years ago
Michael 54k

scale is generic function whose default method centers and/or scales the columns of a numeric matrix.

Now, if your columns are your samples, you would scale rows instead of columns. So no:

t(scale(t(logCPM)))

Z-score is afaik done per column, or per sample, scaling with respect to the overall variation of each sample. therefore should be:

Z.score <- scale(logCPM)

should be correct.

How the gene ENSG00000238009.6 has negative values in the first three samples and positive values in the sample 4 and sample 5?

You scale by row and adjusted row-means to = 0. If some values >0 then others have to be <0 to sum to 0

ADD COMMENT
0
Entering edit mode

But in edgeR tutorial I see that t(scale(t(logCPM))) to show the expression data in heatmap. And again in this post [C: Adding column annotation to heatmap using Complexheatmap] Kevin used the same way to scale data to Z-scores.

ADD REPLY
1
Entering edit mode

Might be, it just depends if you want to scale columns or rows, or both. That means, both ways you get a Z-score, one is per row, and one per column. Both normalizations make sense in some settings. Our confusion is rather about which scaling is commonly applied as "Z-score", but indeed both are.

ADD REPLY
0
Entering edit mode

Thanks a lot for the reply.

ADD REPLY

Login before adding your answer.

Traffic: 2008 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6