Question

Re-counting RNA-seq with new gene-models - to remove or not to remove old overlaps for normalization?

0

Entering edit mode

8.6 years ago

Michael 54k

We now have experimentally validated transcripts and thereby new and improved gene-models for some of our genes. I want to assign counts from our previous RNA-seqs to those, and therefore I will have to re-count. So far so good. I would like to keep the counts for the 'official' predicted models though until the Ensembl models are updated. Do I have to remove the old predicted gene-models that overlap the new ones? I am having the following concerns:

If I do not keep the old models, I am not going to get counts for them and not being able to display these counts or to see if the new models change much, so I should keep all.
If I do not remove old overlapping models, reads hitting the overlapping regions might need to be counted twice.
If reads are counted twice, am I messing with library size/composition normalization (calcNormFactors, double number of mapping reads for some genes) and thereby defeating normalization?
I don't think this is comparable to alternative transcripts, because in the update-case only one of those transcripts likely exists.

I am using easyRNAseq for counting and edgeR for CPM calculation for now.

cpm normalization count RNA-seq • 2.1k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Michael 54k

1

Entering edit mode

You're trying to get per-gene counts, yes? If you already have the old counts, why not just keep the overlapping models, ensure that the ID that you're summarizing over is the same for the old and new models, count with the resulting file and then subtract things to find out how your improved gene models changed things? That avoids double counting.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, per gene counts + per gene normalized CPMs or TPMs. I am not quite sure I totally got your proposition: Are you suggesting to count only the new models (I agree this is more efficient for the counting step) and then compare them to their overlapping predicted models? Then how does this affect normalization? if I subtract the new from the old counts, most old counts would be negative. The new gene models contain both completely new genes and genes with added/deleted/changed exons.

ADD REPLY • link 8.6 years ago by Michael 54k

0

Entering edit mode

Ah, I didn't catch that you were deleting exons too. That does complicate life.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

The "new and improved" gene-models completely supersedes old models? Or do they complement them, such as the old and new models are both valid? The new models overlap just one previous model, or more than one? The simplest case, one new model completely replaces one old, you would have to count just the new model and replace the old counts.

I guess normalizing with both new and old counts would probably break the assumptions of the normalization methods, but if you have just a few new models this would probably not affect normalization heavily (note the "guess" and the two "probably").

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by h.mon 35k

0

Entering edit mode

The 'new and improved' gene models are results of 5'&3' race-pcr + consensus, so indeed I assume they are in general better than the predicted models, they mostly fully cover the predicted models and conform better with RNA-seq peaks. The alignments are of high quality, too.

However, the predicted models are the 'official; Ensembl (prior to release of the genome) models and I am not sure yet, when and how they will be updated. I guess that Ensembl will accept the race based gene models in the future, because RACE is sort of the gold standard of getting the transcript.

In consequence, I assume it's a cleaner solution to just replace the overlaps. Fortunately, all new models overlap only with the predicted ones. Just needed some re-assurance :)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by Michael 54k