Question

Normalizing for RNA abundance across replicates from a time course

0

Entering edit mode

7.1 years ago

Chloe • 0

Hi all,

I am trying to normalize my read counts for differential gene expression with edgeR

I have a set of 21 bam files from aligning my reads to a genome, corresponding to 3 replicates at each of my 7 time points.

I would like to do DGE using edgeR, but first I need to normalize for RNA abundance between replicates.

I was told I might be able to use RSEM or edgeR to produce a normalized count matrix. The issue is that my reads were generated using the QuantSeq library prep kit, so only one fragment is produced per transcript (and therefore the read count should be a direct reflection of the number of transcripts). For this reason QuantSeq recommends using HTSeq to produce a count matrix.

Is there away to produce a count matrix with HTSeq and then normalise across the replicates, without interfering with the fact that the read counts should be a direct reflection of the transcript counts? Can edgeR normalise the count matrix?

I think I have to avoid using FPKM (part of RSEM?) but I am not sure if it is appropriate to use RPKM, TMM, Upper quartile etc. I don't know much about these kinds of counts other then that they exist.

I was trying to work it out with RSEM but it doesn't seem to accept my bam files as they were produced by aligning to a genome not transcriptome

Thanks, Chloe

RNA-Seq normalization RNA abundance edgeR RSEM • 2.1k views

ADD COMMENT • link updated 7.1 years ago by Jake Warner ▴ 830 • written 7.1 years ago by Chloe • 0

score 1 · Answer 1 · 2017-03-21

Hi Chloe, You can use HTseq to generate a count table and then pass it to edgeR. Then, in edgeR, you can group your samples by replicates, normalize (TMM), perform DE tests, etc. I assume you would compare each time-point to it's precedent or to T0.

For example:

#edgeR workflow:
group <- factor(c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7)) #group samples
y <- DGEList(counts=counts, group=group)
mean(y$samples$lib.size) #mean library size
y <- calcNormFactors(y) #TMM normalization
z <- cpm(y, normalized.lib.size=TRUE) # counts per million:
de_T1_T2 <- exactTest(y, pair=c(1,2)) #DE testing
#etc

There's a lot of good info in the edgeR vignette: https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf