Question

Some questions regarding DESeq2

0

Entering edit mode

4.1 years ago

wangdp123 ▴ 340

Hi there,

I am reading through the manual of DESeq2 package and I have run into two questions about how to use this package properly.

1) In order to perform the variance stabilising transformation, there are two ways of doing this.

i) vsd <- vst(dds, blind=TRUE)
ii) vsd <- vst(dds, blind=FALSE)

I understand that using blind=TRUE (by default) is an unsupervised analysis and is good for the quality assurance of samples and using blind=FALSE is to make use of the design formula to estimate the dispersion, which is good for the downstream analysis.

I am wondering which one is recommended or deemed more reasonable if my aim is to make the PCA plots and heatmaps to show the clustering of all samples for the publication?

2) In the condition that the paired samples are to be analysed in terms of differential expression analysis (e.g., the same sample before treatment and after treatment), I realise that the "subject" term should be included in the design formula in addition to the "condition" term. However, which of the below formulas should be preferred and why?

i) ~ subject + condition
ii) ~ subject + condition + subject:condition

Apparently, this question is about under which condition the interaction term ("subject:condition") should be used?

Many thanks,

Tom

DESeq2 RNA-Seq • 2.2k views

ADD COMMENT • link updated 4.1 years ago by dsull ★ 5.9k • written 4.1 years ago by wangdp123 ▴ 340

score 2 · Answer 1 · 2020-03-22

1) Go with blind=FALSE for PCA plots and clustered heatmaps in publications:

"Therefore, for visualization, clustering, or machine learning applications, I tend to recommend blind=FALSE." - Michael Love on https://support.bioconductor.org/p/57940/

2) Just use ~ subject + condition to account for sample pairing

This means that subject and condition are completely separate: if condition affects gene expression, it will do so irrespective of subject; if subject affects gene expression, it will do so irrespective of condition. In other words, the subject-to-subject differences in expression are accounted for (it's basically saying: Each subject has a baseline expression for each gene, and that baseline is different from subject to subject, but the actual effect of the condition or treatment isn't expected to cause a bigger expression change for one subject versus another subject).

You use interaction terms when, say, you actually think that the treatment's (condition's) effect on gene expression will be different depending on the subject (e.g. treatment affects subject A's gene expression changes differently than subject B's gene expression changes). I tend to use interaction terms when, say, I have two variables: treatment and sex, and my treatment affects males differently than females (i.e. there is an interaction between treatment and sex).