Question

Gene-Specific Normalization In Rna-Seq

2

Entering edit mode

10.1 years ago

sarahmanderni ▴ 100

Hi,

Read counts of a gene in RNA-seq studies, are associated to both gene length and expression level. Some papers (like: http://genomebiology.com/2013/14/9/R95) refer to voomlimma and cuffdiff methods as gene-specific normalization approaches. Does this mean that they take into account the gene length as a factor of normalization or there are some more factors in gene-specific normalization approaches?

normalization rna-seq • 3.2k views

ADD COMMENT • link updated 10.1 years ago by Devon Ryan 104k • written 10.1 years ago by sarahmanderni ▴ 100

score 5 · Answer 1 · 2014-03-21

5

Entering edit mode

10.1 years ago

Devon Ryan 104k

Cuffdiff explicitly incorporates gene length, since it uses FPKMs/RPKMs. voom() from the limma packages does't use the gene size at all (any argument that I can come up with that makes the voom method more/less gene-specific is pretty weak). There are also other normalization methods that can take utilize gene-length and gene GC content in normalizations, such as CQN, though I really haven't seen a huge benefit from that in most datasets that I've used (your mileage may vary).

ADD COMMENT • link 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

I have read somewhere (do not remember the reference) that larger genes can induce some biases to the results since intrinsically they produce more reads. Voom in Limma get the log2 of the normalized counts so in this way in logarithmically can decrease the effect of having more reads by default for long genes. Do you think this can be the reason that this has been counted as a gene-specific method?

ADD REPLY • link 10.1 years ago by sarahmanderni ▴ 100

1

Entering edit mode

I still wouldn't consider that to be gene-specific anymore than all of the count-based methods. I should also note that for the most common cases, the whole "bias due to gene length" thing is really overblown. Yes, longer genes will produce more reads at a given expression level than a shorter gene, but that's not really a bias and trying to "correct" for that is partly ignoring the utility of having more evidence of the level of expression of longer genes. Were this does become useful are when you have transcript switching that then alters the effective transcript length. Methods like cufflinks are better equipped to find those sorts of cases. But the thing we then have to ask ourselves is where we go from there. It can still be difficult enough to come up with a coherent story by just focusing on gene-level changes, so the job is that much more difficult if you concern yourself with transcript-level changes (particularly since we generally have no clue what different transcripts might be doing differently...finding changes here will typically not yield much/any insight...at least at this stage of the game).

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

I do agree with you for analysis at transcript level. But in gene level analysis, we know that read counts which are the representatives of expression level are relative amounts based on the whole library size. Therefore, having more reads just due to the gene length but not the expression level itself might affect the shorter genes expression level. How do you think at this point?

ADD REPLY • link 10.1 years ago by sarahmanderni ▴ 100

0

Entering edit mode

Yes, read counts are competitive, but trying to "account" for that is just throwing away information. Frankly, the normalization methods generally do a pretty good job at getting groups to a comparable state for comparison, so it's rarely the case that up-regulation of a few long genes is going to cause aberrant findings in a few short genes (this might not be the case if the number of changes is very large, but most of the non-spike-in normalization methods end up breaking down at that point).

ADD REPLY • link 10.1 years ago by Devon Ryan 104k