How to graphically tell if data has been normalized?
1
4
Entering edit mode
5.7 years ago
Arindam Ghosh ▴ 510

Is there a method to tell if my RNA-seq data has been normalized? I used DESeq2 to normalize my feature count data and they plotted the log transformed normalized counts with box-plots. Is this a good way to analyse? I had referred some articles where they had used this method. For median normalization the median line would be at the same level and for quartile normalization the third quartile is at the same level. For DESeq2, it uses RLE method. So how can I graphically explain this? One thing that is clear from this plot is that most of the counts across all samples have come within comparable range.

boxplot_unnormalized

boxplot_normalized

DESeq2 RNA-Seq normalization • 9.3k views
ADD COMMENT
7
Entering edit mode
5.7 years ago

There are different ways to gauge [graphically] how effective a normalisation has been. Looking at your second plot, it would appear in this case that normalisation has been successful.

Apart from box-and-whisker plots, one can also do:

Violin plot

Using regularised log or variance stabilised counts:

require(reshape2)
violinMatrix <- reshape2::melt(loggedCounts)
colnames(violinMatrix) <- c("Gene","Sample","Expression")

library(ggplot2)
ggplot(violinMatrix, aes(x=Sample, y=Expression)) + geom_violin() + theme(axis.text.x = element_text(angle=45, hjust=1))

violin


pairwise sample scatter plots

Using regularised log or variance stabilised counts:

require(car)
scatterplotMatrix(loggedCounts, diagonal="boxplot", pch=".")

f


Dispersion plot

Just looking at the unlogged, normalised counts, a dispersion plot gives a good idea of how good the modelling of dispersion dependent on the mean normalised counts has been.

options(scipen=999)
plotDispEsts(dds, genecol="black", fitcol="red", finalcol="dodgerblue", legend=TRUE, log="xy", cex.axis=0.8, cex=0.3, cex.main=0.8, xlab="Mean of normalised counts", ylab="Dispersion")
options(scipen=0)

g

------------------------------

-------------------------------

More for outlier detection:

Bootstrapped hierarchical clustering (unsupervised - i.e. entire dataset)

Using regularised log or variance stabilised counts:

require(pvclust)
pv <- pvclust(loggedCounts, method.dist="euclidean", method.hclust="ward.D2", nboot=100)
plot(pv)

dend

Principal components analysis

Symmetrical sample heatmap

Using regularised log or variance stabilised counts:

require(gplots)
distsRL <- dist(t(loggedCounts))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(metadata$IDlist, metadata$condition, sep=", "))
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=rev(hmcol), cexRow=1.0, cexCol=1.0, margin=c(13, 13), key=FALSE)

d

Kevin

ADD COMMENT
0
Entering edit mode

Thank you very much. This is really going to help me.

I more thing I need to clear. For box/violin plots do the first/third quartile or median for all the samples be at the same levels. For the same data set i was trying a different low count filter and after that I saw that the third quartile and the median are at level across sample but the first quartile of group A is same but for group B its much lower.

ADD REPLY
1
Entering edit mode

It is normal to exclude low counts, but which cut-off did you use?

ADD REPLY
0
Entering edit mode

I was experimenting with several like:

1) rowSums(count) > 0

2) rowMeans(count) > 1

3) rowSums(count>0)>10

4) apply(count, 1, function(x){all(x>0)})

The plots I provided above are from #4 and this is from DESeq2 manual (if I remember correctly). Initially it was good to see the number of genes getting reduced but after DE I realised I lost some important genes only because in 1 out of 45 samples it had 0 count.

The results of #1,2 & 3 were similar.

ADD REPLY
0
Entering edit mode

Hey, well, you definitely should exclude anything with just 0 counts across all samples (#1). I can see why #1, #2, #3 give similar results. #4 may be too stringent, as I think that it should be expected that some samples will return a 0 count value (#4 requires that all samples have counts>0 ? ).

ADD REPLY
0
Entering edit mode

So what's your opinion on using this. Something else I should try. Apart from this if I read correctly DESeq performs further cleaning.

ADD REPLY
1
Entering edit mode

I think that either of your first three are valid - I have seen them being used in various studies. #4 is too stringent - you would require a good reason for using such a threshold, like, for example, you needed to use some statistical test where zero vales were not permitted.

ADD REPLY
0
Entering edit mode

I guess I need a bit more of help regarding this. I going through the graphs generated by the first three methods and something feels bad is going on; especially if you see box plot of normalized counts and density plot. Please see the attached files. The plots are similar for all of them so representing only the ones generated by #3 method. For normalized count boxplot the median and upper quartile are in range but not the lower one. And for density I guess the curves should have overlapped.

boxplot_normalized density Dispersion ecdf MA pca sample_heatmap1

ADD REPLY

Login before adding your answer.

Traffic: 1825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6