Batch Effect Removal in RNA-Seq Data for DGE & WGCNA. Should I omit any samples according to PCA plots?
1
0
Entering edit mode
7.6 years ago
gokce.ouz ▴ 70

Hi All,

I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control).

Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I used removeBatchEffect from limma package and plotted PCA & did hierarchical clustering using spermann correlation to see which samples are outliers.(I attached the graphs below). I also used an alternative PCA &plotting (prcomp() & fviz_pca_ind) which actually confused me. The ggplot & fviz_pca_ind graphs look different from each other. Which one should I trust?

Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately. Should I add it to the model design ? If so, how should I desing ? or Should I trim all fastq files to 76 bp to have same size ? What do you suggest?

I would like to do differential gene expression analysis and WGCA. So Should I remove any samples according to these PCA plots or just do analysis as I already included batch effect into the model design as "~Batch+Type"

I really appreciate if you give me some insight,

Thanks in advance,

DESeq2

ddsMat.pre<- DESeqDataSetFromMatrix(seMat.pre, sample.table.ERSvsPT.pre, design=~Batch+Type)
dds.pre<-DESeq(ddsMat.pre)
dds.pre <- dds.pre[ rowSums(counts(dds.pre)) > 1, ]

Normalization

rld.batch <- rlog(dds.pre, blind= FALSE)
rlogMat.batch <- assay(rld.batch)

Removing batch effect

design=model.matrix(~sample.table.ERSvsPT.pre$Type)
removedbatch=removeBatchEffect(rlogMat.batch, batch=sample.table.ERSvsPT.pre$Batch, design= design )
rld.removed<-rld.batch
assay(rld.removed)<-removedbatch

Plotting the PCA

before removing batch effect:

data.batch<-plotPCA(rld.batch, intgroup=c("Batch","Type"), returnData=TRUE)
data.batch$name<-colData(rld.batch)$Patient
percentVar.batch <- round(100 * attr(data.batch, "percentVar"))
ggplot(data.batch, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=2) + xlab(paste0("PC1: ",percentVar.batch[1],"% variance")) +   ylab(paste0("PC2: ",percentVar.batch[2],"% variance")) + ggtitle("PCA of RLD_Type+Batch ") + coord_fixed()

after removing batch effect:

data.removed<-plotPCA(rld.removed, intgroup=c("Batch","Type"), returnData=TRUE)
data.removed$name<-colData(rld.removed)$Patient
percentVar.removed <- round(100 * attr(data.removed, "percentVar"))
ggplot(data.removed, aes(PC1, PC2, color=Batch, shape=Type, label= name)) + geom_text(size=6,hjust = 0, nudge_x = 0.05) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar.removed[1],"% variance")) +  ylab(paste0("PC2: ",percentVar.removed[2],"% variance")) + ggtitle("PCA after removeBatchEffect ") + coord_fixed()

Alternative PCA method : prcomp() & fviz_pca_ind

before.pca <- prcomp(t(rlogMat.batch), scale=T)
after.pca <- prcomp(t(removedbatch), scale=T)
quali.sup <- as.factor(colData(rld.readlength)$Type)
fviz_pca_ind(before.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.75) + theme_minimal()
fviz_pca_ind(after.pca,habillage = quali.sup, addEllipses = TRUE, ellipse.level = 0.68) + theme_minimal()

Before removing the batch effect:

image: Before removing the batch effect

After removing the batch effect:

image: After removing the batch effect

Alternative PCA plot for Before Batch effect removal:

image: Alternative PCA plot for Before Batch effect removal

Alternative PCA plot for After Batch effect removal:

image: Alternative PCA plot for After Batch effect removal

Batch-effect RNA-Seq R DESeq2 • 6.0k views
ADD COMMENT
1
Entering edit mode
7.6 years ago

The ggplot & fviz_pca_ind graphs look different from each other. Which one should I trust?

Those are different graphs (biplot vs graph of indidividuals) so one can expect different output. The first representation is used more often as far as I know and might be a safer pick.

Another issue I am facing is 1st & 2nd batches are 76 bp paired end, however the last batch is 101 bp paired end.I did not add this into the design separately. Should I add it to the model design ? If so, how should I desing ? or Should I trim all fastq files to 76 bp to have same size ? What do you suggest?

I don't think it is necessary to add this in the design since it is already taken into account in the batch factor. You definitely should NOT trim the 101 bp-reads fastq files.

I would like to do differential gene expression analysis and WGCA. So Should I remove any samples according to these PCA plots or just do analysis as I already included batch effect into the model design as "~Batch+Type"

For DEG analysis in DESeq, there is no need to explicitely remove the batch effect because it is already taken into account, as you said. For WGCA I don't know, I never did that kind of analysis.

ADD COMMENT
0
Entering edit mode

I really appreciate your insights. Thanks a lot Carlos.
So I do not need to remove the P5 &P6 which belong to the left hand side group in first two graphs.

ADD REPLY
1
Entering edit mode

I don't see a reason to remove them. Don't be too scared by the vertical distance between P5/P6 and the others points : the y-axis represents only 8% of total variance.

ADD REPLY
0
Entering edit mode

In general for WGCNA you should adjust for batch effects prior to performing the analysis (input should be a matrix containing normalized expression). In addition it is recommended to remove low count "genes" and apply a variance stabilizing transformation when using RNAseq as input (from WGCNA FAQ page: https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html).

ADD REPLY

Login before adding your answer.

Traffic: 2255 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6