Okay, so I have been using DESeq2 for a few months now and I think I have the hang of it. I am attempting a more complicated analysis and was hoping for some guidance / to be told what I am doing is 100% wrong. I have a fairly complicated set of samples which can be summarized as the following:
I have a target gene of interest, which I can turn on and off using a drug and I have a base cell line, in which I knocked out either one or two other genes. So I have the single KO line and the double KO line. We have RNAseq data from these two with the +/- gene being on or off. Just to clarify, the drug I use to control the gene works kind of backwards, I stop giving the drug to the cells and the gene comes on. So in the first comparison the numerator or experimental condition is really the drug "negative". So we can think of the following two variables:
knockOutVar <- c("single", "double")
drugVar <- c("negative", "positive")
As a follow up experiment, we took the double knockout cell line and further knocked out the gene of interest, and prepared a complementation plasmid. Our gene of interested has 3 known binding partners (lets say bindA
, bindB
, bindC
). So we prepared three mutant versions of our gene which are deficient in binding to one of the above genes. We deliver these genes via plasmid to the double knockout cell line in the drug negative condition (so our mutant interest gene can be active). We also have a wild type complementation plasmid with a none mutant gene of interest and the empty vector (calling it cherry
since thats in on the vector). Things getting a bit more complicated now with the following variables (again we did RNAseq for each sample):
complementationVar <- c("m-bindA", "m-bindB", "m-bindC", "WT", "cherry")
Finally we have the batch effect since all of this happened over 3 sequencing runs (there is an issue here with mutant bindC
being done in it's own batch but thats besides the point, I know from PCA/MDS I am gonna take it in the shorts on that one since I can't correct the batch. There are three batches:
batchVar <- c("first", "second", "third")
Okay softer reading through the DESeq2 vignette, seems like the ~group
approach is the way to go. I will say that I am trying to do this all in one single ddsMF object so my final variables look like this:
knockOutVar <- c("single", "double")
drugVar <- c("negative", "positive")
# I add None here for samples from the original comparisons that aren't complemented
complementationVar <- c("m-bindA", "m-bindB", "m-bindC", "WT", "cherry", "None")
batchVar <- c("first", "second", "third")
Here is my starting code:
countMtx <- read.table("./data.mtx")
countMtx <- round(countMtx)
eNames <- colnames(countMtx)
# the 1st 28 samples are all the double KO, the last are the single
knockOutVar <- c(rep("double", 28), rep("single", 8))
# assign batch effects
batchVar <- c(rep("first",2), rep("third",2),
rep("first",2), rep("third",2),
rep("first",2), rep("second",2),
rep("third",4),
rep("first",2), rep("second",2),
rep("first",2), rep("second",2),
rep("first",2), rep("second",2),
rep("first",2), rep("third",2),
rep("first",2), rep("third",2))
complementationVar <- c(rep("None", 8),
rep("m-bindA", 4),
rep("m-bindC", 4),
rep("m-bindB", 4),
rep("Cherry", 4),
rep("WT", 4),
rep("None", 8))
drugVar <- c(rep("neg", 4), rep("pos", 4), rep("neg", 20), rep("neg", 4), rep("pos", 4))
## Check if correct
cbind(eNames, knockOutVar, batchVar, complementationVar, drugVar)
## Construct DEseq object
colData <- as.data.frame(cbind(knockOutVar, batchVar, complementationVar, drugVar))
colnames(colData) <- c("knockOutVar", "batchVar", "complementationVar", "drugVar")
countMtx.deseq <- DESeqDataSetFromMatrix(countData = countMtx, colData = colData, design = ~ complementationVar)
## Run DEseq (filter low count genes, require at least 2 reads)
ddsMF <- countMtx.deseq
ddsMF <- ddsMF[ rowSums(counts(ddsMF)) > 1, ]
ddsMF$knockOutVar <- factor(ddsMF$knockOutVar, levels=c("double", "single"))
ddsMF$batchVar <- factor(ddsMF$batchVar, levels=c("first", "second", "third"))
ddsMF$drugVar <- factor(ddsMF$drugVar, levels=c("pos","neg"))
ddsMF$complementationVar <- factor(ddsMF$complementationVar, levels=c("None", "m-bindA", "m-bindC", "m-bindB", "Cherry", "WT"))
ddsMF$group <- factor(paste(ddsMF$knockOutVar, ddsMF$batchVar, ddsMF$complementationVar, ddsMF$drugVar, sep="-"))
design(ddsMF) <- ~ group
ddsMF <- DESeq(ddsMF)
Things I am confused about here, I have to have a design to call DESeqDataSetFromMatrix
, but the vingette has me building the group later so I just picked on variables for the first design and then make the group. Is that ok?
And the last big question comes down to making contrasts for comparisons. My goal is to do the following:
Find the genes that change when the interest gene is active. This comes from the first two comparisons drug +/- in the double and single knockouts. Remember that we are having drug negative as the numerator since it is the "treatment". My understanding about contrasts is this: resultsNames()
defines a vector of groups + the intersect. I make a vector of ones and zeros for each type in the group and set ones in the correct places based on whats in resultsNames()
. So for example lets define the vector corresponding to treatment negative single knockout.
resultsNames(ddsMF)
[1] "Intercept" "groupdouble.first.Cherry.neg"
[3] "groupdouble.first.m.bindA.neg" "groupdouble.first.m.bindB.neg"
[5] "groupdouble.first.None.neg" "groupdouble.first.None.pos"
[7] "groupdouble.first.WT.neg" "groupdouble.second.Cherry.neg"
[9] "groupdouble.second.m.bindA.neg" "groupdouble.second.m.bindB.neg"
[11] "groupdouble.second.WT.neg" "groupdouble.third.m.bindC.neg"
[13] "groupdouble.third.None.neg" "groupdouble.third.None.pos"
[15] "groupsingle.first.None.neg" "groupsingle.first.None.pos"
[17] "groupsingle.third.None.neg" "groupsingle.third.None.pos"
That is awful to look at so I cleaned it up a bit for the sake of this post (removed "group" since its redundant)
Index Group
0 Intercept
1 double.first.Cherry.neg
2 double.first.m.bindA.neg
3 double.first.m.bindB.neg
4 double.first.None.neg
5 double.first.None.pos
6 double.first.WT.neg
7 double.second.Cherry.neg
8 double.second.m.bindA.neg
9 double.second.m.bindB.neg
10 double.second.WT.neg
11 double.third.m.bindC.neg
12 double.third.None.neg
13 double.third.None.pos
14 single.first.None.neg
15 single.first.None.pos
16 single.third.None.neg
17 single.third.None.pos
Now lets make our vector for treatment negative single knockout which is single
, None
and neg
corresponding to index 14 and index 16. So the vector:
single.neg <- c(1, rep(0, 13), 1, 0, 1, 0)
single.neg
[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
So everything gets a 1
for intercept and then a 1
for its group. Now we make single pos:
single.pos <- c(1, rep(0, 13), 0, 1, 0, 1)
single.pos
[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
Lastly, we make the contrast by subtracting numerator from denominator (remembering that negative is the numerator)
single.cont <- single.neg - single.pos
single.de <- subset(results(ddsMF, contrast=single.cont), padj < 0.05)
Is this all correct? Has anyone gotten this fair? Here's my session info.
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] org.Hs.eg.db_3.2.3 RSQLite_1.0.0
[3] DBI_0.4 AnnotationDbi_1.32.3
[5] RColorBrewer_1.1-2 vsn_3.38.0
[7] pheatmap_1.0.8 ggplot2_2.1.0
[9] DESeq2_1.10.1 RcppArmadillo_0.6.700.6.0
[11] Rcpp_0.12.4 SummarizedExperiment_1.0.2
[13] Biobase_2.30.0 GenomicRanges_1.22.4
[15] GenomeInfoDb_1.6.3 IRanges_2.4.8
[17] S4Vectors_0.8.11 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] BiocInstaller_1.20.2 futile.logger_1.4.1 plyr_1.8.3
[4] XVector_0.10.0 futile.options_1.0.0 zlibbioc_1.16.0
[7] rpart_4.1-10 preprocessCore_1.32.0 annotate_1.48.0
[10] gtable_0.2.0 lattice_0.20-33 Matrix_1.2-6
[13] gridExtra_2.2.1 genefilter_1.52.1 cluster_2.0.4
[16] locfit_1.5-9.1 grid_3.2.2 nnet_7.3-12
[19] data.table_1.9.6 XML_3.98-1.4 survival_2.39-3
[22] BiocParallel_1.4.3 foreign_0.8-66 limma_3.26.9
[25] latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.48.0
[28] lambda.r_1.1.7 Hmisc_3.17-4 scales_0.4.0
[31] splines_3.2.2 xtable_1.8-2 colorspace_1.2-6
[34] affy_1.48.0 acepack_1.3-3.3 munsell_0.4.3
[37] chron_2.3-47 affyio_1.40.0