Question

RNA-seq, why normalize for library size?

6

Entering edit mode

5.4 years ago

joselu ▴ 110

Hello. In an RNAseq experiment I do not understand why the number of readings for each sample should be normalized. The differences between the number of readings of the samples is not due to the differential expression of the genes? Why should we normalize this data? Thank you

RNA-Seq • 18k views

ADD COMMENT • link updated 5.4 years ago by Bastien Hervé 5.3k • written 5.4 years ago by joselu ▴ 110

0

Entering edit mode

Please read: http://netiquette.wikia.com/wiki/Rule_number_2_-_Do_not_use_all_caps and edit your title

ADD REPLY • link 5.4 years ago by Ram 43k

0

Entering edit mode

Please do not write in ALL CAPS, there is no need to yell.

ADD REPLY • link 5.4 years ago by h.mon 35k

score 21 · Answer 1 · 2018-11-19

There are multiple bias engaged in RNAseq experiment : library size, genes length, RNA population composition for each condition and genes GC composition

Two bias can be discard if you compare genes amongst conditions only, because these two are inherent to the gene : genes length and genes GC composition

genes length : The raw count of two genes cannot be face off if gene A is twice longer than gene B. Due to its length, the longest gene will have much chance to be sequenced than the short one. And in the end, for the same expression level, the longest gene will get more read than the shortest one (pub1)
genes GC composition : I did not get the full explanation of this bias. For two genes with different GC content, the one with the closest GC content to 40% will be more sequenced than the other one. (pub2)

The others bias are "technical bias", due to your sample and sequencing method.

library size : the most well know bias. You create two libraries for two conditions with the same RNA composition. The second library works way better than the first one, you got 12 000 000 reads for condition A and 36 000 000 reads for condition B. You will have three times (36 000 000/12 000 000 = 3) more of each RNA in your condition B than your condition A. (pub3)

RNA population composition for each condition : This one is more tricky. Let's say you have again two conditions A and B. For each condition, you want to study 4 genes and you want 90 reads (by condition)

Biologicaly, in your condition A, you got 3 genes expressed the same way (Gene1, Gene2 and Gene 3), arbitrary unit of 2, and you also got a gene (Gene 4) at 24 which is 12 times more expressed than the three others. In condition B, you also got these 3 genes expressed the same way at 2 but Gene 4 is not expressed at all.

In your desing, you want 90 reads for each conditions (A and B). Reads will be spread out according to the expression level. So, in condition A you have 12 times more reads on Gene 4 than on the 3 others (72/6 = 12). The funny thing is that in condition B, you also have 90 reads to spread, but this time, Gene 4 is not expressed. The reads will be spread out over the three genes left (Gene1, Gene2 and Gene 3).

You knew that the expression level were similar for Gene 1 for condition A and condition B. Expression level for gene 1 in condition A is 5 times smaller than expression level for gene 1 in condition B, biased by the miss of Gene 4.

To reduce these bias, there are a lot of method to normalize RNAseq data.

Those which I call naive ones :

Total count
Upper Quartile
RPKM (Reads Per Kilobase per Million, which is not solid enought for cross condition experiment, pub4 & pub5)

Those with a statistical power :

For the batch effect

RLE method (Relative log Expression) like DESeq2
TMM method (Trimmed Mean of M values) like edgeR

Plus, the most used rule to normalize gene count :

negative binomial distribution (edgeR, DESeq2)

Add to that a multiple testing correction, to output strong express genes (DESeq2)

I would say that, for the same amount of money, you better create replicates over a better gene covery.

This is my naive understanding of the subject, be free to correct what I said here.

Useful links (one is in french sorry) : pub6, pub7

Other links : pub8, pub9, pub10, pub11, pub12

score 2 · Answer 2 · 2018-11-17

2

Entering edit mode

5.4 years ago

h.mon 35k

The differences between the number of readings of the samples is not due to the differential expression of the genes?

No, the differences between the number of readings is due to accidental variations in how much each different library is loaded into the flowcell and sequenced.

When loading a multiplexed RNAseq experiment into the flowcell, one quantifies the DNA (the initial RNA is reverse transcribed to DNA) amount for each library, "normalizes" the libraries (i.e., dilutes all libraries to the same DNA concentration), and loads the same amount of each library into the flowcell. In an ideal world, all libraries would have the same number of reads, and then no library size normalization would be necessary for the analysis - this is rarely the case, though, and there is substantial reads number variation between libraries.

ADD COMMENT • link 5.4 years ago by h.mon 35k

0

Entering edit mode

OK. But there may also be differences in expression. But how do we know if the differences are due to accidental variations or to a true differential expression? Thank you.

ADD REPLY • link 5.4 years ago by joselu ▴ 110

0

Entering edit mode

Could you edit your title please?

ADD REPLY • link 5.4 years ago by Ram 43k

0

Entering edit mode

Read the edgeR User Guide, it provides both a good introduction to differential expression analysis; and several references if you want to study the methods in detail.

ADD REPLY • link 5.4 years ago by h.mon 35k

score 2 · Answer 3 · 2018-11-19

Apart from the differences in library depth explained by H.mon an additional problem is that RNASeq frequently have different amounts of different RNA types in them.

A simple example could be that you have more rRNA in one sample than in another (lets say 1% vs 20%) if you do not take this into account it would look like the majority of protein coding genes were downregulated simply because they would get a smaller fraction of reads.

Such effects is handled by doing a inter-library normalization an analysis build into all the major DE tool workflows. You can read more about this problem here.