Question

Strategy For Rna Seq Normalization Across Different Tissue Samples With Different Cell Density.

4

Entering edit mode

10.2 years ago

k.nirmalraman ★ 1.1k

Hello All,

This is another question about RNA Seq data normalization. Often, I have read papers using ERCC spike in as control for identifying experimental bias that may occur due to RNA species length and concentration. Then I came across this paper: Revisiting Global Gene Expression Analysis, where they talk about "Transcriptional Amplification".

The key message is proposed in this figure

enter image description here

They have demonstrated using cell lines that with usual RNA Seq experimental and normalization, we may not detect differentially expressed genes effectively (?) in cases where we have transcriptional amplification. The proposed solution here is to use the ERCC spike in standards proportional to cell number and then normalize accordingly

I am wondering how do we handle such a scenario, when we perform such an experiment in Tissue Samples, where we cannot determine the number of cells. We start with same quantities of total RNA for library preparation and do not account for the spatial gene expression patterns/transcriptional amplification.

Are there any controls or data handling procedures that is in use already? Any new strategy would be nice to discuss.

May be we can ERCC to normalize for tissues as well, but how?

rna-seq normalization • 7.2k views

ADD COMMENT • link updated 10.1 years ago by Charles Warden 8.2k • written 10.2 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

I think this paper raises some very important issues. It has made us encourage all collaborators to do ERCC spike-ins by default now.

However, your question is very pertinent. I don't think there is any way to do it. The authors of the paper suggest doing DNA quantification as a surrogate, but I'm not sure how that would work in practice?

ADD REPLY • link 10.2 years ago by Chris Cole ▴ 800

1

Entering edit mode

I feel like this experimental design is, in a way, trying to answer two separate questions with one approach.

Usually in a standard differential expression experiment, when transcriptional amplification is not considered, you would be trying to find out a subset of genes that are deferentially expressed to indicate, e.g. activation of a certain pathway. We will call that Question 1.

Using spike ins like this tells you if there is transcriptional amplification. We will call that Question 2.

It seems to me that a side effect of answering Question 2 like this is that you lose some information about Question 1. In at least the simple schema of the figure, all of the genes are going to be called differentially expressed because there is universal amplification.

But to get back at Question 1, I believe you would still have to do a second normalization of the data using a more conventional approach that normalizes the two conditions to the same level, under the assumption that you would still expect to see a proportional increase in certain genes if certain pathways were activated in a test condition, even if the cells themselves were bigger and had more RNA in them.

I don't know a better solution, though.

ADD REPLY • link 10.2 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

Certainly, in an RNA-Seq experiment of case vs control, we would like to capture both Q1 and Q2. If we conveniently ignore Q1 (or Q2) how relevant and accurate is our final set of genes to our experimental objective?

I can agree that we would have to perform two step normalization as you propose, but I wonder how this would transform the data...

ADD REPLY • link 10.2 years ago by k.nirmalraman ★ 1.1k

score 1 · Answer 1 · 2014-03-03

I think you've already got some good comments, but this is what I would say off the top of my head:

1) The y-axis in in the first part of 1B and 1D is "counts". I would typically look at differential expression for RPKM or FPKM values. This would normalize for the total number of reads per sample (which might have been influence by the total expression per sample). I'm guessing that this is what the 2nd plot is supposed to represent, but I haven't had a chance to read this particular paper.

2) In the example above, the fold-change values for the problematic sample are small and would have been ignored if using a fold-change cutoff of 2 (which on a log2 scale would be 1) or even 1.5 (which on a log2 scale would be 0.58). Additionally, low coverage genes can be problematic when it comes to fold-change values, so i round the RPKM values in order to effectively ignore those problematic situations.

In other words, this is the strategy that I would use:

Normalization: log2(RPKM + 0.1)

Differential Expression: |Fold-Change| > 1.5 and FDR < 0.05

This would filter out the genes shown in Figures 1C and 1D (based upon the specific fold-change values shown, which I think is a fair representation of where that problem is most likely to occur) . If you are curious where I came up with that rounding factor, please check the following paper: http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html