Hello, I'm having a hard time understanding the time course experimental work flow in DESeq2. I am using the F1000 paper as a template (Love MI, Anders S, Kim V and Huber W. RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 1; referees: 2 approved]. F1000Research 2015, 4:1070 (doi: 10.12688/f1000research.7035.1))

Experiment: I have 2 genotypes, 3 timepoints per genotype. 4 replicates in each genotype-condition pair (3 in one). I have a counts table generated in HTSeq - 23 samples.

```
coldata <- data.frame(sample=colnames(data),
genotype=as.factor(c(rep("het",4), rep("wt",4), rep("het",4), rep("wt",4),rep("het",3), rep("wt",4))),
time=as.factor(c(rep("es",8), rep("np",8),rep("nsc",7))))
```

sample genotype time
1 es1 het es

2 es2 het es

3 es3 het es

4 es4 het es

5 es5 wt es

6 es6 wt es

7 es7 wt es

8 es8 wt es

9 np1 het np

10 np2 het np

11 np3 het np

12 np4 het np

13 np5 wt np

14 np6 wt np

15 np7 wt np

16 np8 wt np

17 nsc1 het nsc

18 nsc2 het nsc

19 nsc4 het nsc

20 nsc5 wt nsc

21 nsc6 wt nsc

22 nsc7 wt nsc

23 nsc8 wt nsc

I have tried it 2 different ways, with slightly different results. My questions are: 1. What is the difference between the two? I have used both elements as factors. 2. It's not clear how the relevel is worning under these conditions 3. Using the same results table, can I compare the intermediate timepoints, i.e, 1st and 2nd then 2nd and 3rd, generating a esults table for each, or would I have to separate the analysis entirely?

Thanks!

Here goes what I have done:

```
dds <- DESeqDataSetFromMatrix(countData=data, colData=coldata, design=~ time + genotype:time + genotype)
colData(dds)$genotype <- relevel(colData(dds)$genotype, "wt")
ddsTC <- DESeqDataSetFromMatrix(countData=data, colData=coldata,
design=~ time + genotype:time + genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ time + genotype)
resTC <- results(ddsTC)
resTC$symbol <- mcols(ddsTC)$symbol
head(resTC[order(resTC$padj),],4)
```

Results: log2 fold change (MLE): timensc.genotypewt LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' DataFrame with 4 rows and 6 columns

```
summary(resTC)
```

out of 29372 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 9, 0.031% LFC < 0 (down) : 20, 0.068% outliers [1] : 6, 0.02% low counts [2] : 9434, 32% (mean count < 3) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

```
resultsNames(ddsTC)
```

[1] "Intercept" "genotype_wt_vs_het" "condition_np_vs_es"

[4] "condition_nsc_vs_es" "genotypewt.conditionnp" "genotypewt.conditionnsc"

Second iteration:

```
resTC2 <- results(ddsTC, contrast=c("genotype","het","wt"))
head(resTC2[order(resTC2$padj),],4)
```

log2 fold change (MLE): genotype het vs wt LRT p-value: '~ time + genotype:time + genotype' vs '~ time + genotype' DataFrame with 4 rows and 6 columns

```
summary(resTC2)
```

out of 29372 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 16, 0.054% LFC < 0 (down) : 13, 0.044% outliers [1] : 6, 0.02% low counts [2] : 9434, 32% (mean count < 3)

```
> sessionInfo()
```

R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.5 (Yosemite)

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages:
[1] sva_3.18.0 genefilter_1.52.1

[3] mgcv_1.8-12 nlme_3.1-126

[5] PoiClaClu_1.0.2 hexbin_1.27.1

[7] RColorBrewer_1.1-2 vsn_3.38.0

[9] ggplot2_2.1.0 pheatmap_1.0.8

[11] DESeq2_1.10.1 RcppArmadillo_0.6.100.0.0
[13] Rcpp_0.12.4 SummarizedExperiment_1.0.2
[15] Biobase_2.30.0 GenomicRanges_1.22.4

[17] GenomeInfoDb_1.6.3 IRanges_2.4.8

[19] S4Vectors_0.8.11 BiocGenerics_0.16.1

[21] maptools_0.8-39 sp_1.2-2

DESeq2 and edgeR are both capable of analyzing complex designs, such as time series with interaction terms, leveraging the same base R functionality as limma: the formula class and the model.matrix function. Where did you pick up that DESeq2 or edgeR are only for two group comparisons? You can even include smoothing functions like splines for DESeq2, edgeR or limma equally.

DESeq and edgeR both use a negative binomial distribution to calculate p-values. Even in the example above, time is being defined as a factor and not a continuous, numeric variable.

If the goal was to identify genes that change as a function of time without the genotype information, this would be problematic. In this case, "time" is defined a little differently than I would expect (as "es", "np", and "esc"). If instead time was defined as a continuous variable (I don't know what is appropriate in this case, but if you timepoints of 1, 2, and 3 days and you wanted a single p-value to reflect the change in expression over time), then I would expect DEseq to give an error message. For example, I get this error message if I try to run analysis on just time (for a different set of samples):

If I go back to your tutorial (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670015/#__sec31title ), the question that I am concerned about would be "Can you use DESeq2 to identify changes that occur as a function of time (in this case, minutes) in only the WT samples?"

However, I do apologize for being off-topic due to 1) the way time is defined here and 2) the fact that there are still different genotypes being compared. The tutorial is an excellent resource for people whose goal is to identify genes that change between groups at multiple time points.

Thanks for replying, and I hope I can clear some things up:

The fact that DESeq2 and edgeR use a NB and limma-voom uses a log linear model isn't really relevant here.

It's fine to use time as a continuous variable. For example, you could use a design of ~genotype + f(time) + genotype:f(time) where you use some smooth function for f, such as smoothing splines, to find genes where there is a genotype-specific change in the dependence of expression on time. Or you can use ~f(time) and compare to ~1, or whatever you like. There are many ways in base R to specific this f, ns(), bs(), poly() etc., and users need to be familiar with these approaches to produce reasonable results.

That's not an error. That's just a message.

The point of this message is, sometimes users will input a column called "treatment", with levels 1,2,3,4, etc. And this column can easily be a numeric type, using the default R functions for reading in the colData. However, that would be a big modeling mistake, treatment 4 is not (treatment 2)^2, and so on.

So this message says essentially, "are you sure that's what you want to do?". It just prints the message and let's you continue with your analysis.

In R there are 3 levels of communication with the user:

These look like: