Question

Best way to address different batches of RNA-seq

0

Entering edit mode

7.0 years ago

tud55122 ▴ 20

I did two batches of RNA-seq upon same drug treatment but at different time points on the same cell line. For each time point, I have corresponding DMSO controls. All my Fold Change values are calculated over the corresponding DMSO control. I'm trying to trace the gene expression dynamics upon my drug treatment. I found that if I simply merge two datasets into one master table and plot the RPKM and Fold Change values, the numbers do not reflect the real dynamics I saw by q-PCR. For example: I knew one gene (and a few more) that can be induced highest upon 4-day treatment but the RPKM values and Fold Changes do not show the same trend. So my question is:

What's the best way to address this issue?

Appreciate your help.

Hang

sequence RNA-Seq RPKM Fold-Change batch-effect • 3.5k views

ADD COMMENT • link updated 21 days ago by Ram 43k • written 7.0 years ago by tud55122 ▴ 20

1

Entering edit mode

Below are the scripts I used.

samples=read.table("Samples.txt",sep="\t",header=T)
counts = readDGE(samples$countf)$counts
noint = rownames(counts) %in% c("__no_feature","__ambiguous","__too_low_aQual"                       ,"__not_aligned","__alignment_not_unique","no_feature","ambiguous","too_low_aQua l","not_aligned","alignment_not_unique")
counts = counts[!noint,]
dim(counts)
cpms = cpm(counts)
keep = rowSums(cpms>0)>=1
counts = counts[keep,]
dim(counts)
d = DGEList(counts=counts, group=samples$condition)
d = calcNormFactors(d)
design <- model.matrix(~0+samples$condition)
d <- estimateGLMCommonDisp(d,design)
d <- estimateGLMTrendedDisp(d,design) Loading required package: splines
d <- estimateGLMTagwiseDisp(d,design)
fit <- glmFit(d,design)
et <- glmLRT(fit, contrast=c(-1,1,0,0,0,0,0,0,0,0))
tt = topTags(et, n=nrow(d),sort.by="none")
write.table(tt,"EdgeR/test.txt",sep="\t",row.names=T, quote=F)

ADD REPLY • link updated 21 days ago by Ram 43k • written 7.0 years ago by tud55122 ▴ 20

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 7.0 years ago by GenoMax 142k

0

Entering edit mode

Ok, then perhaps you could try to add the batch into the model/design. Although if you processed the DMSO/drug samples at the same time, it shouldn't be very impactful.

ADD REPLY • link 7.0 years ago by Carlo Yague 8.7k

0

Entering edit mode

Is your fold change also computed based on RPKM values ? RPKM is not a proper between sample normalization metric so I suggest that you try a different method such as edgeR/DESEQ.

ADD REPLY • link 7.0 years ago by Carlo Yague 8.7k

0

Entering edit mode

Thanks for the reply. My fold change is computed based on counts values. I used edgeR to calculate counts and generated RPKM values.

ADD REPLY • link 7.0 years ago by tud55122 ▴ 20

0

Entering edit mode

What was different between the two batches? In addition, as said above you shouldn't use RPKM, use statistics as in DESeq2

ADD REPLY • link 6.9 years ago by Asaf 10k

0

Entering edit mode

As mentioned by @tud55122 and @Asaf you shouldn't use RPKM values. I have used DESeq2 and there you have option to use batch+ condition information to make comparisons. I am not aware is edgeR ihas similar options. If you have counts data then you can easily follow DESeq2 vignette to see how to do that. I use the 'vst' based normalization which is better than RPKM based normalization. Hope it helps.

ADD REPLY • link 4.7 years ago by piyushjo ▴ 700

0

Entering edit mode

Seems you have two timepoints? Two conditions? and then a number of replicates per condition/timepoint - could you describe exactly which samples were sequenced together?

ADD REPLY • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k