Proper normalization method
2
2
Entering edit mode
9.4 years ago
Cytosine ▴ 460

Hey everybody,

I'm getting my hands dirty with comparing count data from a couple of RNA-Seq samples. The goal is pretty simple... Normalize the data and compare the expression levels (not looking for DE genes, just expression levels).

I have X total counts in sample 1 and Y total counts in sample 2. They are stored in a bed file of counts per 100 bp windows.

Is it enough to divide the amount of counts in each row of the sample bed files by the total amount of counts from that sample (e.g. for i in rows_in_file: X[i]/X and Y[i]/Y)? Does this make the obtained values comparable between samples?

Consider also that I do not have information on gene structure (e.g. length for RPKM/FPKM).

Would also appreciate hints on other suitable methods for this task. (statistics not being my strong side... yet :) )

Thanks!

RNA-Seq statistics normalization counts • 5.3k views
ADD COMMENT
0
Entering edit mode

I guess it depends on exactly what you mean by "compare the expression levels". Do you mean between genes while accounting for sample differences?

ADD REPLY
0
Entering edit mode

Yes, that is essentially my goal.

ADD REPLY
3
Entering edit mode
9.4 years ago
Asaf 10k

(not looking for DE genes, just expression levels)

Actually, getting DE genes is much easier than getting expression levels. The relation between counts per gene and the actual amount of mRNAs in the sample is not trivial. When you compare the counts per gene between two samples you assume that the change you see in the counts represents the change in mRNAs because the biases should be the same. Try taking the same example and preparing RNA-seq libraries using two different protocols (we did it), you'll see that some genes (or a lot of genes) will have different counts in the two libraries (and this difference will be reproducible) this means that counts doesn't truly represents the mRNAs.

EDIT:

Following the comments I'll add some more thoughts. I came across a similar situation, luckily for me the total number of reads in the two libraries was about the same so I could just compare the raw counts between the libraries. If the two libraries look very much the same (you can plot the base-by-base correlation) then dividing each count in the total (mapped) number of counts in the library is not a bad idea.

If the libraries are different (below 0.9 or 0.85 Spearman correlation) I wouldn't divide in the total number of reads and would prefer the DESeq normalization method. In a nutshell, you assume that most of the positions in the genome have the same expression levels in the two libraries. For each position you compute Xi/Yi and take the median of these quotients as your normalization factor (multiply Yj by the median for each j in the genome). You are now able to compare Xj with the normalized Yj.

ADD COMMENT
0
Entering edit mode

Well my goal is to compare expression levels between samples because I do not have structural information about genes. I do agree that the counts I get from a pileup do not represent the actual amount of mRNA in the sample, but for the sake of comparing between samples prepared with the same protocol, the similar systematic bias present in both samples and introduced by the RNA extraction protocol should be negated with between-samples normalization of their counts.

The actual levels of mRNA in the sample do not interest me and I believe to get them, a wet-lab approach would be more suitable than an in-silico one.

ADD REPLY
0
Entering edit mode

Can't you predict transcripts from the RNA-seq data?

ADD REPLY
0
Entering edit mode

I can, but that's not what I'm trying to achieve here.

ADD REPLY
0
Entering edit mode

I think I understand what you're trying to do now, I'll edit my answer.

ADD REPLY
0
Entering edit mode

Thank you for this one! My libraries differ quite a lot, so comparison of raw counts won't cut it for me. I'll try out the normalization method you mentioned. Just a detail though... Do you mean to multiply both Yj and Xj by the median for each j in the genome or just Yj?

ADD REPLY
0
Entering edit mode

The complete method is described in the DESeq manual/paper (for genes rather than positions but it should work as well). You should multiply only Yj in the median quotient so it will comparable to Xj (write it down, the equation will make sense)

ADD REPLY
2
Entering edit mode
9.4 years ago

The problem with total count normalization is that it's sensitive to any change in highly-expressed species (e.g., a difference in rRNA depletion will cause incorrect normalization). You have three popular options to handle this type of data: (1) the DESeq/DESeq2 normalization method, RLE; (2) the edgeR normalization method, TMM; and (3) quantile normalization. In most cases, these produce similar results. In your case, I'd personally try quantile normalization followed by a paired T-test. You'll still want to have a look at the results along side the raw data and see if it makes any sense at all.

ADD COMMENT
0
Entering edit mode

A step closer to getting in grips with biostatistics. Thanks Devon!

ADD REPLY
0
Entering edit mode

Happy to help! After you get the results, please post a comment and let me know how well that actually works in practice.

ADD REPLY
2
Entering edit mode

Following up on this topic... Devon, I tried your suggestion for a quantile normalization and the processed data seemed to act a bit strange (lot of zero values appearing). Then I also tried your and Asaf's suggestion about the DESeq normalization and it seemed to perform much better (very few something->nothing and vice-versa conversions), so I decided this is the one I'm going to use. Thumbs up to both of you for this suggestion!

ADD REPLY

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6