Question

Help understanding Deseq2 '~group' contrasts

2

Entering edit mode

7.9 years ago

mbio.kyle ▴ 380

Okay, so I have been using DESeq2 for a few months now and I think I have the hang of it. I am attempting a more complicated analysis and was hoping for some guidance / to be told what I am doing is 100% wrong. I have a fairly complicated set of samples which can be summarized as the following:

I have a target gene of interest, which I can turn on and off using a drug and I have a base cell line, in which I knocked out either one or two other genes. So I have the single KO line and the double KO line. We have RNAseq data from these two with the +/- gene being on or off. Just to clarify, the drug I use to control the gene works kind of backwards, I stop giving the drug to the cells and the gene comes on. So in the first comparison the numerator or experimental condition is really the drug "negative". So we can think of the following two variables:

knockOutVar <- c("single", "double")
drugVar <- c("negative", "positive")

As a follow up experiment, we took the double knockout cell line and further knocked out the gene of interest, and prepared a complementation plasmid. Our gene of interested has 3 known binding partners (lets say bindA, bindB, bindC). So we prepared three mutant versions of our gene which are deficient in binding to one of the above genes. We deliver these genes via plasmid to the double knockout cell line in the drug negative condition (so our mutant interest gene can be active). We also have a wild type complementation plasmid with a none mutant gene of interest and the empty vector (calling it cherry since thats in on the vector). Things getting a bit more complicated now with the following variables (again we did RNAseq for each sample):

complementationVar <- c("m-bindA", "m-bindB", "m-bindC", "WT", "cherry")

Finally we have the batch effect since all of this happened over 3 sequencing runs (there is an issue here with mutant bindC being done in it's own batch but thats besides the point, I know from PCA/MDS I am gonna take it in the shorts on that one since I can't correct the batch. There are three batches:

batchVar <- c("first", "second", "third")

Okay softer reading through the DESeq2 vignette, seems like the ~group approach is the way to go. I will say that I am trying to do this all in one single ddsMF object so my final variables look like this:

knockOutVar <- c("single", "double")
drugVar <- c("negative", "positive")

# I add None here for samples from the original comparisons that aren't complemented
complementationVar <- c("m-bindA", "m-bindB", "m-bindC", "WT", "cherry", "None")
batchVar <- c("first", "second", "third")

Here is my starting code:

countMtx <- read.table("./data.mtx")
countMtx <- round(countMtx)
eNames <- colnames(countMtx)


# the 1st 28 samples are all the double KO, the last are the single
knockOutVar <- c(rep("double", 28), rep("single", 8))

# assign batch effects
batchVar <- c(rep("first",2), rep("third",2),
              rep("first",2), rep("third",2),
              rep("first",2), rep("second",2),
              rep("third",4),
              rep("first",2), rep("second",2),
              rep("first",2), rep("second",2),
              rep("first",2), rep("second",2),
              rep("first",2), rep("third",2),
              rep("first",2), rep("third",2))

complementationVar <- c(rep("None", 8),
             rep("m-bindA", 4),
             rep("m-bindC", 4),
             rep("m-bindB", 4),
             rep("Cherry", 4),
             rep("WT", 4),
             rep("None", 8))

drugVar <- c(rep("neg", 4), rep("pos", 4), rep("neg", 20), rep("neg", 4), rep("pos", 4))

## Check if correct
cbind(eNames, knockOutVar, batchVar, complementationVar, drugVar)

## Construct DEseq object
colData <- as.data.frame(cbind(knockOutVar, batchVar, complementationVar, drugVar))
colnames(colData) <- c("knockOutVar", "batchVar", "complementationVar", "drugVar")
countMtx.deseq <- DESeqDataSetFromMatrix(countData = countMtx, colData = colData, design = ~ complementationVar)

## Run DEseq (filter low count genes, require at least 2 reads)
ddsMF <- countMtx.deseq
ddsMF <- ddsMF[ rowSums(counts(ddsMF)) > 1, ]
ddsMF$knockOutVar <- factor(ddsMF$knockOutVar, levels=c("double", "single"))
ddsMF$batchVar <- factor(ddsMF$batchVar, levels=c("first", "second", "third"))
ddsMF$drugVar <- factor(ddsMF$drugVar, levels=c("pos","neg"))
ddsMF$complementationVar <- factor(ddsMF$complementationVar, levels=c("None", "m-bindA", "m-bindC", "m-bindB", "Cherry", "WT"))
ddsMF$group <- factor(paste(ddsMF$knockOutVar, ddsMF$batchVar, ddsMF$complementationVar, ddsMF$drugVar, sep="-"))
design(ddsMF) <- ~ group
ddsMF <- DESeq(ddsMF)

Things I am confused about here, I have to have a design to call DESeqDataSetFromMatrix, but the vingette has me building the group later so I just picked on variables for the first design and then make the group. Is that ok?

And the last big question comes down to making contrasts for comparisons. My goal is to do the following:

Find the genes that change when the interest gene is active. This comes from the first two comparisons drug +/- in the double and single knockouts. Remember that we are having drug negative as the numerator since it is the "treatment". My understanding about contrasts is this: resultsNames() defines a vector of groups + the intersect. I make a vector of ones and zeros for each type in the group and set ones in the correct places based on whats in resultsNames(). So for example lets define the vector corresponding to treatment negative single knockout.

resultsNames(ddsMF)
 [1] "Intercept"                      "groupdouble.first.Cherry.neg"  
 [3] "groupdouble.first.m.bindA.neg"  "groupdouble.first.m.bindB.neg" 
 [5] "groupdouble.first.None.neg"     "groupdouble.first.None.pos"    
 [7] "groupdouble.first.WT.neg"       "groupdouble.second.Cherry.neg" 
 [9] "groupdouble.second.m.bindA.neg" "groupdouble.second.m.bindB.neg"
[11] "groupdouble.second.WT.neg"      "groupdouble.third.m.bindC.neg" 
[13] "groupdouble.third.None.neg"     "groupdouble.third.None.pos"    
[15] "groupsingle.first.None.neg"     "groupsingle.first.None.pos"    
[17] "groupsingle.third.None.neg"     "groupsingle.third.None.pos"

That is awful to look at so I cleaned it up a bit for the sake of this post (removed "group" since its redundant)

Index   Group
0      Intercept
1   double.first.Cherry.neg
2   double.first.m.bindA.neg
3   double.first.m.bindB.neg
4   double.first.None.neg
5   double.first.None.pos
6   double.first.WT.neg
7   double.second.Cherry.neg
8   double.second.m.bindA.neg
9   double.second.m.bindB.neg
10  double.second.WT.neg
11  double.third.m.bindC.neg
12  double.third.None.neg
13  double.third.None.pos
14  single.first.None.neg
15  single.first.None.pos
16  single.third.None.neg
17  single.third.None.pos

Now lets make our vector for treatment negative single knockout which is single, None and neg corresponding to index 14 and index 16. So the vector:

single.neg <- c(1, rep(0, 13), 1, 0, 1, 0)
single.neg
[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0

So everything gets a 1 for intercept and then a 1 for its group. Now we make single pos:

single.pos <- c(1, rep(0, 13), 0, 1, 0, 1)
single.pos
[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

Lastly, we make the contrast by subtracting numerator from denominator (remembering that negative is the numerator)

single.cont <- single.neg - single.pos
single.de <- subset(results(ddsMF, contrast=single.cont), padj < 0.05)

Is this all correct? Has anyone gotten this fair? Here's my session info.

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.2.3         RSQLite_1.0.0             
 [3] DBI_0.4                    AnnotationDbi_1.32.3      
 [5] RColorBrewer_1.1-2         vsn_3.38.0                
 [7] pheatmap_1.0.8             ggplot2_2.1.0             
 [9] DESeq2_1.10.1              RcppArmadillo_0.6.700.6.0 
[11] Rcpp_0.12.4                SummarizedExperiment_1.0.2
[13] Biobase_2.30.0             GenomicRanges_1.22.4      
[15] GenomeInfoDb_1.6.3         IRanges_2.4.8             
[17] S4Vectors_0.8.11           BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.20.2  futile.logger_1.4.1   plyr_1.8.3           
 [4] XVector_0.10.0        futile.options_1.0.0  zlibbioc_1.16.0      
 [7] rpart_4.1-10          preprocessCore_1.32.0 annotate_1.48.0      
[10] gtable_0.2.0          lattice_0.20-33       Matrix_1.2-6         
[13] gridExtra_2.2.1       genefilter_1.52.1     cluster_2.0.4        
[16] locfit_1.5-9.1        grid_3.2.2            nnet_7.3-12          
[19] data.table_1.9.6      XML_3.98-1.4          survival_2.39-3      
[22] BiocParallel_1.4.3    foreign_0.8-66        limma_3.26.9         
[25] latticeExtra_0.6-28   Formula_1.2-1         geneplotter_1.48.0   
[28] lambda.r_1.1.7        Hmisc_3.17-4          scales_0.4.0         
[31] splines_3.2.2         xtable_1.8-2          colorspace_1.2-6     
[34] affy_1.48.0           acepack_1.3-3.3       munsell_0.4.3        
[37] chron_2.3-47          affyio_1.40.0

RNA-Seq Deseq2 DE • 3.7k views

ADD COMMENT • link updated 3.9 years ago by Biostar 20 • written 7.9 years ago by mbio.kyle ▴ 380

score 1 · Answer 1 · 2016-05-11

Make your life easier and build group first and then make the DESeqDataset. You can certainly do things this way, but the more steps you have the more likely you are to screw something up.
Why are you rounding your counts? What you even inputting if they need to be rounded (please don't say FPKMs)?
Don't include the intercept in your contrasts. Your example contrast is more easily written c(rep(0, 14), 0.5, -0.5, 0.5, -0.5). That'll give you the same answer (or should, the coefficients in the model matrix will all be doubled in your case, but that'll cancel out), but skips a lot of typing and and is easier to debug.
You should really be collaborating with a statistician or bioinformatician for something like this.