Hi everyone,
I want to draw a heatmap of my RNA-seq data and was wondering, what kind of normalizations or transformations of the data are 'acceptable'.
The data is from an RNA-seq experiment with multiple treatments. I am using R, RPKM data from DEseq and pheatmap in this case, but the question is agnostic from this. The log2 data from the example plot is below.
1) Of course, plotting unscaled RPKM data is not satisfactory because of the highly different expression of genes.
2) People frequently use log-transformed RPKM data which mostly overcomes this problem, while still preserving information about different expression strength between genes and changes across conditions. This already gives already quite good results! However, depending on the color sheme, the attention is still drawn to the highest and lowest expressed genes, while the changes of medium-expressed genes don't stand out and tend to be overlooked.
Example: The expression changes in gene 9 and 10 are equally strong as for gene 5, but barely visible! Also, Gene 3 has the strongest expression changes, still appears just moderate.
I tried a few color shemes, but none could really avoid this problem, since the range of values spans from ~ -3 to ~ 12
3) The pheatmap package has a scale="column" option, that normalizes all columns (here: expression per gene) to Mean = 0 and SD = 1 by a simple transformation:
function (x)
{
m = apply(x, 1, mean, na.rm = T)
s = apply(x, 1, sd, na.rm = T)
return((x - m)/s)
}
I think, this transformation cannot be applied in this context, since the changes per gene are scaled and you cannot tell anymore, if the changes between treatments are small or large.
4) I figured that just centering the means (mean = 0 by substracting column means) WITHOUT applying the scaling (dividing by SD) might be a good transformation, since the strength of expression changes is preserved but the scale range is the same for every gene unlike in the 2nd example.
I really like the 4th approach, but I am not sure how this would be considered by statisticans or reviewers. Any thoughts, concerns and alternative approached on this? I am open to any comments!
Of course, an alternative approach would be plotting (log2-) fold changes relative to the control condition, which is then usally not shown, because it is = 1 for all cases. Personally I don't find this intuitive, but I will opt into it if all else fails.
log2 transformed RPKM data
| | condition 1| condition 2| condition 3| condition 4|
|:-------|-----------:|-----------:|-----------:|-----------:|
|Gene 1 | 7,8581343| 7,7760309| 5,9845672| 5,4520139|
|Gene 2 | -0,5718784| -0,2093000| -3,6001834| -3,2168730|
|Gene 3 | -0,8589436| -0,8816851| 4,6821235| 3,6949570|
|Gene 4 | 0,3470576| 0,6363893| 0,9525574| 2,9178951|
|Gene 5 | 10,4977433| 12,6749883| 9,3837503| 10,0322095|
|Gene 6 | -2,7455632| -0,7356815| -0,1536280| 1,5129004|
|Gene 7 | 0,4665848| -1,8998796| -0,3328291| 2,0449151|
|Gene 8 | -1,4584591| -3,0859498| -0,8403647| 0,2276394|
|Gene 9 | 4,3642826| 2,1160312| 4,0913918| 4,4801707|
|Gene 10 | 6,6822088| 6,9493600| 4,8194632| 4,6275799|
Hello Kevin, thank you for this really fast and comprehensive answer!
The whole sequencing and computational pipeline was run by a collaboration partner, so I have to consider this a black box. I am just reading the RPKM, l2fc and p values from the generated reports. Thus, i just intend to find a good representation of the data, without doing any testing.
Yeah, I tend to over-worry a lot ;-)! I will look at the zFPKM function!
Also thank you for pointing out the clustering issue! I tried clustering, but left it out for this minimal example.
Heinrich