Question

Batch effect, how to create a model matrix

3

Entering edit mode

4.9 years ago

Mozart ▴ 330

Hello there, hope all of you are fine. I do hope you are enjoying this weekend.

In my little experience, I always had to deal with samples coming from different batches (i.e. coming from different hospitals or experiments done in different days). One postdoc in my lab showed me how to deal with batch effect by using SVA package. I guess it is a brilliant idea to work with that, but if I don't go wrong this is not the best tool to cope with the batch effect as, in my cases, I may know the source of confounding (in other words, I know that samples are generated in different days/ come from different hospitals).

My first question is: at first glance, by looking at the PCA plot from this experiment, how can you determine (i.e. be absolutely sure) that your samples are biased or not. How can you be absolutely sure that your samples need a correction, if they cluster as expected? and if they don't cluster as you would expect how can you know that this is not due to real biology or not? what if you are over-correcting samples and removing relevant biological data that, in turn, make impossible to determine genes that are truly changing? how can you know that?

My second question is more practical and, basically, linked to my inexperience. Given the fact I have always used this SVA package by slavishly following postdoc recommendation (with her own doubts), I would like to deal with this problem in a definite way; hopefully, by designing a model matrix. I have zero ideas on where to start and if you could help me with some advice, that would be grand! I really need your help guys, cause I am absolutely alone now and don't know who to ask. I have found this link but it seems to be a bit advanced for myself.

RNA-Seq deseq2 batch-effect • 4.1k views

ADD COMMENT • link updated 2 days ago by Ram 43k • written 4.9 years ago by Mozart ▴ 330

0

Entering edit mode

Thanks Carlo for your reply and to all of them who will provide further help to this big question mark in my head!

so let's say we really want to know genes found to be differentially expressed in neutrophils_cancer_type1, neutrophils_cancer_type2, neutrophils_healthy.

                     factor_1                batch
sample1       neutrophils_cancer_type1         I     
sample2       neutrophils_cancer_type1         II    
sample3       neutrophils_cancer_type2         I      
sample4       neutrophils_cancer_type2         III       
sample5       neutrophils_cancer_type1         I       
sample6       healthy                          II
sample7       healthy                          II
sample8       healthy                          III
sample9       neutrophils_cancer_type2         II

I have not included 'factor2' as in Carlo's example because I don't think there is another variable here.

So, question number 1: Can you confirm this?

OK, let's go ahead. I am using Kallisto to perform pseudoaligment and then DESeq2 via tximport (i.e. to generate a txi.kallisto.tsv count matrix that will be used as input for DESeq2). So, once you've generated your Sample Table, if your samples come from the same batch you are ready to go with the following:

dds = DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~condition

But, in this case, should I consider something like that?:

dds$batch = factor(c("I","II","I","III","I","II","II","III","II"))

Then, create with DESeq2 (via limma package)

vsd = vst(dds)
plotPCA(vsd,"batch")
assay(vsd) = limma::removeBatchEffect(assay(vsd),vsd$batch)
plotPCA(vsd, "batch")

So, if I got it right, by doing this you would be able to take batch effect into account when doing differential expression analysis? So, by doing this from now on

dds=DESeq2(dds)

you can consider the latter dds free from batch effect, because you have take into account for this in the aforementioned code?

I think I am missing something, here.

ADD REPLY • link updated 4.9 years ago by Devon Ryan 104k • written 4.9 years ago by Mozart ▴ 330

0

Entering edit mode

The vsd stuff is just for the PCA, you need to change your design to ~batch+condition.

ADD REPLY • link 4.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon, so, after having read here and here, it seems clear that I have to do:

dds = DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~batch+condition)
dds$batch = factor(c("I","II","I","III","I","II","II","III","II"))

and, by doing this,

dds <- DESeq(dds, full=design(dds), reduced = ~ batch)
res=results(dds)
...and so on

one should have taken into account the fact that in your experiment there's a batch effect and thus that genes that are differentially expressed are truly dependent by biological effect?

ADD REPLY • link 4.9 years ago by Mozart ▴ 330

1

Entering edit mode

Correct, this controls for the batch effect. It's not the same as what limma is doing, but you'll have to stick to limma if you really want to use limma:removeBatchEffect() for the DE step.

ADD REPLY • link 4.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for clarifying that. In terms of biological results(low false positives and likelihood to truly address biological question) what’s the difference between controlling for batch effect and removing it from the analysis? What’s the usefulness of controlling for a bias in your experiment rather than removing the source of its bias? Thanks

ADD REPLY • link 4.9 years ago by Mozart ▴ 330

2

Entering edit mode

I'll quote from limma's manual:

This function is not intended to be used prior to linear modelling. For linear modelling, it is better to include the batch factors in the linear model

So the limma authors suggest not using it for differential expression. I expect that in practice doing DE on batch-effect corrected data leads to higher false-positive and false-negative results. One would have to search the bioconductor site for discussion of this though.

ADD REPLY • link 4.9 years ago by Devon Ryan 104k

0

Entering edit mode

4.9 years ago

Mozart ▴ 330

Thanks both of you, guys. These suggestions really help myself to clarify this complex problem that I know it is kind of uncertain also amongst the advanced users. Thanks.

This is directly linked to both of your replies as I am really trying to understand how I can efficiently design a formula by including two confounding (ie day of experiment and hospitals). Should I merge these ones in a single variable or split these up in two?

My argument is:

dds = DESeqDataSetFromTximport(txi.kallisto.tsv, sampleTable, ~day+hospital+condition)
dds$day = factor(c("I","II","I","III","I","II","II","III","II"))
dds$hospital = factor(c("A","A","A","B","C","D","B","C","D"))

thus

dds <- DESeq(dds, full=design(dds), reduced = ~ day+hospital)
res=results(dds)

ADD COMMENT • link 4.9 years ago by Mozart ▴ 330

0

Entering edit mode

Hi Mozart, can. you tell me the rational of I,II,III in your batch? to me worked but I just added I,II ,III in my colData file without any rational ...

thank you

ADD REPLY • link 4.7 years ago by Morris_Chair ▴ 350

0

Entering edit mode

not sure I understood, sorry. my understanding is, you have to add columns in your sampleTable and then perform the differential expression analysis in order to control for batch effect or any source of variance you aim to control. Hopefully someone more experienced than myself would be able to help!

ADD REPLY • link 4.7 years ago by Mozart ▴ 330

0

Entering edit mode

yes, that's clear but the number I-II-III where do they come from? can I choose randomly? Thanks

ADD REPLY • link 4.7 years ago by Morris_Chair ▴ 350

0

Entering edit mode

no, you can't. you have to know when (or where) your samples were extracted or processed. "I" means to me "day1"; thus, all samples extracted on day1 of my experiment get their variable "I" but you could have also chosen "day 1" or simply "1". that's literally up to you

ADD REPLY • link 4.7 years ago by Mozart ▴ 330

score 3 · Accepted Answer · 2019-05-26

At first glance, by looking at the PCA plot from this experiment, how can you determine (i.e. be absolutely sure) that your samples are biased or not:

I think it is safe to assume that there is always some technical bias in an experiment. So rather than determining whether your samples are biased or not, a PCA allows qualitatively assess how big is this technical bias (batch effect) compared to the other factorial effects. Note that you need at least two samples from the same batch to see if your samples group by batch in the principal component space.

How can you be absolutely sure that your samples need a correction, if they cluster as expected? and if they don't cluster as you would expect how can you know that this is not due to real biology or not?

To interpret the PCA, always think in term of variability. The samples will always cluster depending on the major source of variability. Usually there would be at least three sources of variability: the controlled biological factors (for instance, healthy vs diseased sample), the batch effect (day, hospital), and the residual biological/technical variability:

If the samples cluster based on the controlled biological factors, that means that it is the main source of variability. In consequence, you are likely to find biological effects. I would still correct for batch effect though, because being a minor source of variability doesn't mean that there is no batch effect at all.
If the samples primarily cluster based on batch, it could mean that the variability caused by batch is very high or that the variability caused by the controlled biological factors is low. Either way, you'll need to correct for batch.
Finally, if sample are clustered seemingly randomly, it means that the residual biological/technical variability is the main source of variability. It is the worse case scenario because you can not really control for this kind of residual effect. Usually, it means that you will need a highly powered study in order to extract relevant biological information from that "mess".

I would like to deal with this problem in a definite way; hopefully, by designing a model matrix

The idea of linear model, which is a great way to take care of batch effects, is also linked to the sources of variability. The link you provided is a good start, but I'll try to make the basics clearer. In a linear model, you will express a measure, for instance gene expression, by the sum of modeled effects of biological/batch factors you know of:

 gene expression ~ factor_1 + factor_2 + factor_1:factor_2 + batch
("~" means  that what is on the left is explained by the formula on the right ; ":" means that you choose to also model the interaction factor_1 and factor_2)

The formula above is dependent of a design matrix that specify the levels of the factors for each sample. for instance:

       factor_1   factor_2  batch
sample1       A         X       I
sample2       A         Y      II
sample3       A         Y       I
sample4       B         X       I
sample5       B         Y       I
sample6       B         Y      II

Hope this helps,

Carlo