Question

edgeR differential expression analysis - switch intercept coefficient

0

Entering edit mode

3.0 years ago

Al Murphy ▴ 30

I am trying to use edgeR for differential expression analysis of RNA-Seq count dataset. My samples are split into case and controls and I would like to know the genes that are up or down regulated in case samples (i.e. those with the condition) versus controls. However, I am having an issue where currently genes' results are related to the control samples rather than case. I can reproduce the issue in R with fake data.

The fake data has lower expression in control than case samples so we would expect all genes to be up-regulated in case samples:

#First create the expression matrix
set.seed(101)
#create data so first 50 (the controls) have lower values than second 50 samples (those with the condition)
exprDat <- cbind(matrix(round((runif(500)/10)*100),ncol=50),
                    matrix(round((1-runif(500)/10)*100),ncol=50))
colnames(exprDat) <- paste0("sample_",1:100)
rownames(exprDat) <- paste0("gene_",1:10)

#Now create the annotation dataset
targets <- data.frame("group_sample"=colnames(exprDat),
                      "case_control"=as.factor(c(rep("Control",50),
                                                        rep("Case",50))))
#create the design matrix comparing case and control
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
#normalise
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
#build linear model
fit <- edgeR::glmFit(y, design = design)
#test the comparison, coef=1 is the intercept
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table

My issue with the above is that the logFC all relate to the control samples rather than the case. We can see this in the design firstly since the column is case_controlControl:

> head(design)
  (Intercept) case_controlControl
1           1                   1
2           1                   1
3           1                   1
4           1                   1
5           1                   1
6           1                   1

and then the logFC are thus stating these genes are down-regulated in control samples compared to case:

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511

At first I thought this was not a problem since I could just change the factor ordering in targets so the desing matrix would create a case_controlCase which would thus be the opposite comparison, meaning the p-values would be the same but the direction of the logFC would be flipped:

#reorder levels in target
levels(targets$case_control) <- sort(levels(targets$case_control),
                                        decreasing=TRUE)
design <- model.matrix(~case_control, data = targets)
y <- edgeR::DGEList(counts = exprDat,
                    group = targets[["case_control"]])
y <- edgeR::calcNormFactors(y,method = 'TMM')
y <- edgeR::estimateDisp(y, design)
fit <- edgeR::glmFit(y, design = design)
test <- edgeR::glmLRT(fit,coef=2)
pvals <- test$table

The design matrix updates as expected:

> head(design)
  (Intercept) case_controlCase
1           1                1
2           1                1
3           1                1
4           1                1
5           1                1
6           1                1

However, weirdly the genes are still related to controls:

> pvals
              logFC   logCPM        LR      PValue
gene_1  -0.14418015 16.69933 2.4281485 0.119173587
gene_2  -0.03421562 16.69108 0.1422319 0.706072179
gene_3  -0.12961726 16.69159 1.9632930 0.161161580
gene_4  -0.17710527 16.68963 3.5894597 0.058147147
gene_5  -0.14551401 16.69491 2.4641372 0.116471640
gene_6   0.17585301 16.70497 4.1366713 0.041963611
gene_7  -0.05396444 16.69328 0.3514909 0.553270396
gene_8  -0.15662395 16.69380 2.8394354 0.091976525
gene_9  -0.09823345 16.69603 1.1459499 0.284398595
gene_10 -0.30105913 16.68291 9.8090930 0.001736511

I have no clue why this is still happening since design has changed! If anyone has any clue that would be amazing as this has been wrecking my head for a while! Alternatively, if anyone has a different way to flip the logFC so it relates to the case samples rather than controls (i.e. ensure that control samples are taken as the intercept in the GLM), it would be great. Just to note, I know I could just swap the sign in the resulting table but this is something I really want to avoid and would much rather understand what is going wrong in my code above.

session info:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3               reshape2_1.4.4              data.table_1.14.0           Hmisc_4.5-0                
 [5] Formula_1.2-4               survival_3.2-9              lattice_0.20-38             ggrepel_0.9.1              
 [9] viridis_0.6.0               viridisLite_0.4.0           cowplot_1.1.1               ggplot2_3.3.3              
[13] qs_0.24.1                   edgeR_3.28.1                limma_3.42.2                purrr_0.3.4                
[17] magrittr_2.0.1              dplyr_1.0.6                 SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
[21] DelayedArray_0.12.3         BiocParallel_1.20.1         matrixStats_0.58.0          Biobase_2.46.0             
[25] biomaRt_2.42.1              BSgenome_1.54.0             rtracklayer_1.46.0          Biostrings_2.54.0          
[29] XVector_0.26.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2             
[33] S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] colorspace_2.0-1         ellipsis_0.3.2           htmlTable_2.1.0          base64enc_0.1-3         
 [5] rstudioapi_0.13          listenv_0.8.0            bit64_4.0.5              AnnotationDbi_1.48.0    
 [9] fansi_0.4.2              codetools_0.2-16         splines_3.6.0            cachem_1.0.4            
[13] knitr_1.33               Rsamtools_2.2.3          cluster_2.0.8            dbplyr_2.1.1            
[17] png_0.1-7                sctransform_0.3.2        BiocManager_1.30.12      compiler_3.6.0          
[21] httr_1.4.2               backports_1.2.1          assertthat_0.2.1         Matrix_1.2-17           
[25] fastmap_1.1.0            cli_2.5.0                htmltools_0.5.1.1        prettyunits_1.1.1       
[29] tools_3.6.0              gtable_0.3.0             glue_1.4.2               GenomeInfoDbData_1.2.2  
[33] rappdirs_0.3.3           Rcpp_1.0.6               vctrs_0.3.7              xfun_0.22               
[37] stringr_1.4.0            globals_0.14.0           lifecycle_1.0.0          pacman_0.5.1            
[41] XML_3.99-0.3             future_1.21.0            zlibbioc_1.32.0          MASS_7.3-51.4           
[45] scales_1.1.1             hms_1.0.0                RColorBrewer_1.1-2       yaml_2.2.1              
[49] curl_4.3.1               memoise_2.0.0            rpart_4.1-15             latticeExtra_0.6-29     
[53] stringi_1.5.3            RSQLite_2.2.4            checkmate_2.0.0          rlang_0.4.11            
[57] pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.14            GenomicAlignments_1.22.1
[61] htmlwidgets_1.5.3        bit_4.0.4                tidyselect_1.1.1         parallelly_1.25.0       
[65] plyr_1.8.6               R6_2.5.0                 generics_0.1.0           DBI_1.1.1               
[69] pillar_1.6.0             foreign_0.8-71           withr_2.4.2              RCurl_1.98-1.3          
[73] nnet_7.3-12              tibble_3.1.1             future.apply_1.7.0       crayon_1.4.1            
[77] utf8_1.2.1               BiocFileCache_1.10.2     RApiSerialize_0.1.0      rmarkdown_2.7           
[81] jpeg_0.1-8.1             progress_1.2.2           locfit_1.5-9.4           grid_3.6.0              
[85] blob_1.2.1               infotheo_1.2.0           digest_0.6.27            openssl_1.4.4           
[89] RcppParallel_5.0.3       munsell_0.5.0            stringfish_0.15.0        askpass_1.1

logFC rnaseq edgeR • 1.3k views

ADD COMMENT • link 3.0 years ago by Al Murphy ▴ 30

score 0 · Answer 1 · 2021-05-09

0

Entering edit mode

3.0 years ago

Al Murphy ▴ 30

So I have found the answer I should have used the below:

targets$case_control <- relevel(targets$case_control, "Control")

to re-level the factors rather than renaming them like I did here:

levels(targets$case_control) <- sort(levels(targets$case_control), decreasing=TRUE)

ADD COMMENT • link 3.0 years ago by Al Murphy ▴ 30