Log2(x + 1) transformation in gene expression not normally distributed.
18 months ago
rin • 30

Hi all!

I am using raw counts data from TCGA. As I want to compute the Z-score between tumor and normal samples, I have to first ensure that my data are normally distributed. Until now, I downloaded raw counts, normalized them for their GC content using TCGAanalyze_Normalization() function from TCGAbiolinks, log2(x+1) transfromed them but the distribution is right skewed and definetily not normal, as seen in qqnorm() plots.

How could I tackle that? I have been trying to figure it out for days, but I cannot find a solution.

Thanks a lot, R.

15 months ago
Benn 6.9k
Netherlands

Some data can not be transformed into a normal distribution. RNA-seq count data fits a Poisson distribution or a negative binomial distribution. There is a great answer here about how RNA-seq data is distributed.

RNA-Seq is typically fitted to a Poisson or NB-distribution. Claiming that it fits those distributions is a bit strong though.

15 months ago
Freiburg, Germany

This is expected, RNAseq data should be right-skewed or multimodal.

@Devon Ryan @b.nota @russhh Really helpful link and answers! Thank you! The reason I want them to be normally distributed is to assess the change between tumor and normal expression by computing a Z-score. Would that be possible / have the same interpretation if they fit a Poisson or NB distribution?

Try to use limma or edgeR for this kind of analysis.

