Multi-factor analysis with DESeq: which approach is correct?
1
1
Entering edit mode
5.7 years ago

Hello,

I'd appreciate some feedback on an approach for an experiment.

I am working on an RNA-Seq problem that is exploring RNA-Seq data from samples that vary on three dimensions:

  • Sex (M or F)
  • Timepoint (1 or 2)
  • Genotype (A or B)

My colleagues and I are curious a number of different hypotheses, e.g.:

  • (Q1.) What genes are DE because of Genotype? (like a "main effect" in an ANOVA)
  • (Q2.) What genes are DE through an interaction of Genotype and Timepoint ?
  • (Q3.) What genes are DE through an interaction of Genotype, Timepoint , and Sex?
  • (...and so on).

Additionally, this data likely has batch effects I am trying to account for using SVA.

#

I've gotten a variety of pieces of advice on how to approach this problem:

[Approach 1] Use limma (although be careful how you set up the design matrix)

[Approach 2] Use edgeR

[Approach 3] Use DESeq, but set the design matrix to something like this:

myDESeqObject$SV1 <- svseq$sv[,1] # add in the sole surrogate variable detected by/calculated in SVA to design matrix
design(myDESeqObject) <- ~ SV1 + Sex + Timepoint + Genotype + Sex:Timepoint + Timepoint:Genotype + Sex:Genotype + Sex:Genotype:Timepoint # design matrix
dds <- DESeq(myDESeqObject) # perform DESeq

# look for genes that are DE as a consequence of genotype alone, regardless of influence of other factors (answering Q1)
results_genotype <- results(dds, contrast=c("Genotype", "A", "B"))

# look for genes that are DE as a consequence of genotype alone, regardless of influence of other factors (answering Q)
# (not entirely sure what to put for the specific contrasts - what I put seems like it's just looking at effect of Genotype wrt a single timepoint, so not quite answering Q2?)
results_genotype_timepoint <- results(dds, contrast=c("Timepoint:Genotype", "1.A", "1.B"))

[Approach 4] Use DESeq, but create a new variable, 'group', with 8 different factors representing all possible combinations of individual variables: M_1_A, F_1_A, M_1_B, F_2_B, and so forth. Do DESeq as normal, but use the contrast function to ask what genes are DE between two specific groups, e.g. M_1_A, F_1_A.

In this way, you don't directly find the set of genes that satisfy Q1,Q2, or Q3...but to answer Q1, at least, you can look for the genes that are called in all of these analyses (e.g. an intersection of these DE gene lists):

M.1.A vs. M.1.B ("genes affected by Genotype in Males at Timepoint 1")

M.2.A vs. M.2.B ("genes affected by Genotype in Males at Timepoint 2")

F.1.A vs. F.1.B ("genes affected by Genotype in Females at Timepoint 1")

F.2.A vs. F.2.B ("genes affected by Genotype in Females at Timepoint 2")

[Approach 5] Use Approach 4 PLUS repeat analysis in edgeR - ask what genes are in common between the two analyses - focus on that subset when looking for pathway enrichments/biological functions/etc.

#

The end goal of this RNA-Seq experiment is to identify potential biological targets for validation in a wet lab environment, but I would like to do the best possible job with this bioinformatic portion. Which approach above do you think is the best approach, given the questions we would like to answer? Are there any approaches that stand out as obviously wrong? Any feedback to that end is appreciated.

Thanks for taking the time to read!

RNA-Seq DESeq2 R • 6.8k views
ADD COMMENT
2
Entering edit mode
5.7 years ago

Any of approach 1-3 are fine and you'd use the same design matrix in all of them (you'd also get very similar results). Whether you want to use contrasts like in your example or instead use coefficients is up to you.

Approaches 4 and 5 are obviously wrong given your goals.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. Hmm, I was advised by some folks that the interpretation I offered of the model in Approach 3 (e.g. "Genotype" = "main effect of Genotype") was not entirely correct/that DESeq was not truly equipped to offer a study of interacting factor effects on DE Genes, and they advised doing Approach 4/5 as a workaround. It would be nice if I could use one of those first three approaches.

If I'm interested in retrieving a list of genes that are DE according to two or three factors (i.e., Q2 and Q3), does anyone have recommendations for how to declare the contrasts? For example - if I am interested in "all genes that are differentially expressed due to an interaction of Sex and Genotype", but pull out the contrasts for 1.A and 1.B, that leaves out the genes that might be differentially expressed as a result of the interaction of 2.A and 2.B, which I would also care about.

ADD REPLY
0
Entering edit mode

Who ever told you that DESeq2 wasn't equipped for what you want is woefully incorrect. You don't actually need a contrast in approach 3 (or 1 or 2), just use the coefficients directly.

If you're interested in "all genes that are differentially expressed due to an interaction of Sex and Genotype" then just test that coefficient. You only need to use contrasts when you want to ask questions like in approach 4.

ADD REPLY
0
Entering edit mode

Thank you!

For my own understanding, when you say "use the coefficients directly", you mean I'd implement something like:

# Get coefficient names
matrix(resultsNames(dds))

     [,1]                              
 [1,] "Intercept"                       
 [2,] "SV1"                             
 [3,] "Sex_F_vs_M"                      
 [4,] "Timepoint_1_vs_2"        
 [5,] "Genotype_A_vs_B"            
 [6,] "SexF.Timepoint2"             
 [7,] "TimepointE2.GenotypeB"     
 [8,] "SexF.GenotypeB"               
 [9,] "SexF.Timepoint2.GenotypeB"

# Use coefficients to pull out a differentially expressed gene list for Q1
results_genotype_method2 <- results(dds_sva, name=c("Sex_F_vs_M"))

And when you say "test that coefficient", for example in reference to the interaction of Sex:Genotype:Timepoint, I'd do something like:

# Use coefficients to answer Q3
results_genotype_method2 <- results(dds_sva, name=c("SexF.Timepoint2.GenotypeB"))

...and the resulting DE gene list would represent fold changes etc. when the Female/Timepoint2/GenotypeB samples are the control/reference samples. Am I getting that all correctly?

Thank you for your help - hope by spelling things out for myself here it can help future google-ers.

ADD REPLY
1
Entering edit mode

More or less, though SexF.Timepoint2.GenotypeB is the threeway interaction of Sex, Timepoint and Genotype.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 2804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6