Hello everyone,
I am working on a RNA-seq project that includes samples from normal tissue, adenomas, adenocarcinomas, liver metastases and was done in five batches. As I have previously read, it is best if batch effect is included in the design matrix for linear model fitting- shown in limma guide and (http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html).
However, in the latter article, when they write contrasts, the lanes are not included, not accounted for if i understand correctly. A similar story is shown in limma guide chapter 17.1.5, where the DEGs are for the final coefficient.
I performed the analysis in two different ways, the first as such:
group2<- factor(c("N","A","N","A","N","A","N","A","N","A",
"C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N","C","N",
"C","L","N","C","L","N","C","L","N","C","C","C", "N","L","N","C","L","N","C","L","N","C","L","N","C","L","N","L","N","L","N","L","N","C","L","N","C","L","N","C","L","N","L","N","C","L","N","C","L","N","C","L","N","L","N","C","N","C","L",
"N","A","N","A","N","A","N","A","N","A","N","A","N","A","N","A"),levels = c("N","A","C","L"))
lane2 <- factor(c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5))
design <- model.matrix(~0+group2+lane2)
cont.matrix <- makeContrasts(NvsA=N-A,
NvsC=N-C,
NvsL=N-L,
AvsC=A-C,
AvsL=A-L,
CvsL=C-L,
levels=colnames(design))
The contrast matrix is then
Contrasts
Levels NvsA NvsC NvsL AvsC AvsL CvsL
N 1 1 1 0 0 0
A -1 0 0 1 1 0
C 0 -1 0 -1 0 1
L 0 0 -1 0 -1 -1
L2 0 0 0 0 0 0
L3 0 0 0 0 0 0
L4 0 0 0 0 0 0
L5 0 0 0 0 0 0
and the results include DEGs for NvsA NvsC NvsL AvsC AvsL CvsL. If I understand correctly, this does not account for batch effect and is similar to what the rnaseq-123 article from the link above did?
The other option is similar to the answer to this question https://support.bioconductor.org/p/69374/. I used an interaction type design and averaged the cases from which batch each sample came from and only included the interactions that had a sample inside (otherwise linear model did not work as coefficients were not calculable):
designy1 <- model.matrix(~0+group2:lane2)
colnames(designy1)
colnames(designy1) <- c("N1","A1","C1","L1","N2","A2","C2","L2","N3","A3","C3","L3","N4","A4","C4","L4","N5","A5","C5","L5")
designy2 <- designy1[,c(1:2,5,7,9,11:13,15:18)]
The contrast matrix then looked like this
contr.matrixy <- makeContrasts(NC= (N1+N2+N3+N4+N5)/5-(C2+C3+C4)/3,
nVSa= (N1+N2+N3+N4+N5)/5-(A1+A5)/2,
AVSC= (A1+A5)/2-(C2+C3+C4)/3,
NVL= (N1+N2+N3+N4+N5)/5-(L3+L4)/2,
CVL= (C2+C3+C4)/3-(L3+L4)/2,
AVL=(A1+A5)/2-(L3+L4)/2,
levels=colnames(designy2))
I then as standard used voom normalisation, linear model fitting, etc to get to DEGs.
vy <- voom(countsbindlist4I,design = designy2, plot=TRUE)
fity <- lmFit(vy, designy2)
fit2y <- contrasts.fit(fity, contr.matrixy)
efity <- eBayes(fit2y)
dt<- decideTests(efity)
summary(dt)
Is any of the solutions I used okay, or did I miss the mark with the second one? Thank you all for your help and insights.
Best regards
Dear swbarnes2,
thank you for your reply. In this case the experiment was done in different batches, the batches were included as "lanes". I could have clarified this better with another variable name, but the batch to batch variability is definitely there- visible with dendrograms and mds.
All help still wanted :)
Thank you all.