Question

Make DESEq2 consider all levels in reduced model

0

Entering edit mode

2.2 years ago

totoroGirl • 0

I am trying to study the presence of a mutation in a gene and how it affects the RNA expression of various diagnosis of cancer.

The sample info file looks like:

sample diagnosis geneX_mutation_status

1 cancer_subtype1 wild_type

2 cancer_subtype2 wild_type

3 cancer_subtype3 wildtype

4 cancer_subtype1 mutated

5 cancer_subtype 2 mutated

6. cancer_subtype4 wildtype

Few things of note: total cancer subtypes: 6(levels)

Each subtype considered has both mutated and wild type representation.
The number of wild type is more within each category than the mutated thus cancer_subtype1 can have 3 samples with wild type genotype and only 1 of that mutated.

Number of levels of mutation_status: 2 (wildtype and mutant)

My design is as follows:

#first with the term:

dds <- DESeqDataSetFromMatrix(countData = countfile,
                              colData = coldata,
                              design = ~geneX_mutation_status+diagnosis+geneX_mutation_status:diagnosis)

#removing the term 

dds <- DESeq(dds, test="LRT", reduced =~geneX_mutation_status+diagnosis)

But when I try to check the results using resultnames. I can only see interaction with the wildtype genotype. The results are:

[1] "Intercept"                             
[2] "geneX_mutation_wildtpe_vs_mutant"                
[3] "cancer_subtype2_vs_cancer_subtype1"    
[4] "cancer_subtype3_vs_cancer_subtype1"     
[5] "cancer_subtype4_vs_cancer_subtype1"  
[6] "cancer_subtype5_vs_cancer_subtype1"  
[7] "cancer_subtype6_vs_cancer_subtype1"     
[8] "geneX_mutationwildtype.cancer_subtype2"  
[9] "geneX_mutationwildtypecancer_subtype3"   
[10] "geneX_mutationwildtype.cancer_subtype4"
[11] "geneX_mutationwildtype.cancer_subtype5"
[12] "geneX_mutationwildtype.cancer_subtype6"

So, first I am losing the geneX_mutationmutant category comparison. I am also seeing only a few comparisons between the disease categories in the diagnosis.

From what I understood the reduced model first creates the model with the factor you are trying to test the interaction and then without it. After which it calculates the likelihood model for checking the results being observed are indeed because of the factor we are testing.

Here I am trying to see the interaction between the mutation status on the diagnosis and hence it should consider the mutated samples as well which it isnt.

Is there something I am missing in the model, or the number of samples of mutated vs wild type in each diagnosis effecting this!

Kindly help!

Rnaseq • 707 views

ADD COMMENT • link 2.2 years ago by totoroGirl • 0

score 1 · Answer 1 · 2022-03-18

1

Entering edit mode

2.2 years ago

i.sudbery 19k

In this model, each of the results names represents the change from the baseline (or intercept). In this case, the baseline is mutatant, subtype 1 - so resultsName 2 is "geneX_mutation_wildtype_vs_mutant" or "How the wildtype affects expression compared to the mutation". resultsNames 3-7 compare each of subtypes 2-6 to subtype 1 or "How is expression in subtype2[3/4/5/6] different from subtype1?". Finally, the interaction terms in 8-12 are asking the question "How is the affect of wildtype on expression different in subtype2[2/3/4/5/6] to what the effect is in subtype1?"

Eitherway, as this is an LRT, the resultsNames don't really have a lot of meaning. The p-values will always be the same, irrepsective of which resultsName you use because the P-value fitted in an LRT asks does the full model (as a whole) better explain the variance significanly better than the reduced model (as a whole). The indevidual comparisons for the terms of the full model are not relevant to that question.

ADD COMMENT • link 2.2 years ago by i.sudbery 19k

0

Entering edit mode

Hi! thanks a lot for your answer,

but how is the baseline defined? because I would want to know how the mutant is affecting the expression compared to wild type not vice versa.

Also I do not understand "Eitherway, as this is an LRT, the resultsNames don't really have a lot of meaning. The p-values will always be the same, irrepsective of which resultsName you use because the P-value fitted in an LRT asks does the full model (as a whole) better explain the variance significanly better than the reduced model (as a whole)"

Thanks again!!

ADD REPLY • link 2.2 years ago by totoroGirl • 0

2

Entering edit mode

By default the reference level of any factor is the one that comes first alphabetically. To change this you need to either explicitly set the reference level for the factor, or explicitly define the order when asking for the contrast.

To explicitly set the order you might do:

> colData$geneX_mutation_status <- factor(colData$geneX_mutation_status, levels = c("wildtype", "mutant"))

and use this to create the dds.

Or relevel after the dds is created:

> dds$geneX_mutation_status <- relevel(dds$geneX_mutation_status, ref="wildtype")

See here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Also I do not understand "Eitherway,....

The LRT acts like an annova. It tells you whether the model full explains a significantly better fraction of the variance than the model reduced.

The full model is fit, so that one finds the best coeffs, such that:

count = g(coeffs[1]+ coeffs[2]*geneX_mutation_status + coeffs[3]*subtype2 + .... + coeffs[8]*subtype2*geneX_mutation_status + ....)

is as accurate as possible. The likelihood of the data under this fitted model is then calculated.

Next the reduced model is fit, so that one finds the best coeffs, such that:

count = g(coeffs[1]+ coeffs[2]*geneX_mutation_status + coeffs[3]*subtype2 + ....)

is as accurate as possible (note the lack of any interaction terms beta[i]*subtypeY*geneX_mutation_status).

The likelihood of the data is then calculated under this fitted model.

The ratio of the likelihood under the full model and the reduced model is then taken and a statistical test is performed on this ration (the Likelihood Ratio Test - LRT).

Note that at no point is a particular subtype or a particular mutation status selected to be reported on by the p-value, they are all looked at together. So all the p-value tells you is whether a model that includes all the interaction terms is sufficiently more accurate that one that includes none of them.