Question

How to use Principle Component Analysis to find batch effects?

0

Entering edit mode

7.6 years ago

tolgaturant ▴ 20

I am going to profile a clinical RNA-seq study with 51 samples for differentially expressed genes. As described in limma-voom vignettes,I have created a DGEList object:

y1<-DGEList(counts=assays(summarizedExperiment1)$counts, genes=annotations1)

y2<-calcNormFactors(y1)

Then to explore the clustering of the samples, I have created PCA plots

plotMDS(y2, labels=resp, top=50, col=ifelse(resp=="N", "red", "blue"), gene.selection="common", prior.count=5)

Graph of First to PCs with response group

There is a clear separation of samples over PC1 but I don't know the attribute that correlates with it. Should I create an attribute, as batch_1 for the 2 groups on either side of PC1 and create a model.matrix as:

mod1<-model.matrix(~batch_1+resp)

or should I just model the comparison I am interested in:

mod2<-model.matrix(~resp)

Any suggestion would be appreciated.

Tolga

RNA-Seq PCA limma voom • 2.5k views

ADD COMMENT • link 7.6 years ago by tolgaturant ▴ 20

0

Entering edit mode

Mmmh, In principle adding a batch term would be the way to go.

But are you sure it's a batch effect (let's say something technical) and not something biological that you would want to look at and understand rather than discarding? Just asking since you say that in fact you don't know where the separation is coming from, and I would want to understand what I am about to throw out.

ADD REPLY • link 7.6 years ago by Marge ▴ 320

0

Entering edit mode

Thank you for your answer. I agree that separation over PCs might as well be biological. But there can also be a technical effect that we don't know. I guess one cannot know without additional info. So I ended up processing the study study as is.

ADD REPLY • link 7.2 years ago by tolgaturant ▴ 20