Trying to understand the maths behind one Limma function
1
5
Entering edit mode
4.0 years ago
Mozart ▴ 330

Hi there,

After having read the vignette, I am really keen to understand in depth how the Limma functions 'removebatcheffect' actually works. Looked at the function:

#  removeBatchEffect.R

#  A refinement would be to empirical Bayes shrink
#  the batch effects before subtracting them.

removeBatchEffect <- function(x,batch=NULL,batch2=NULL,covariates=NULL,design=matrix(1,ncol(x),1),...)
#  Remove batch effects from matrix of expression data
#  Gordon Smyth and Carolyn de Graaf
#  Created 1 Aug 2008. Last revised 1 June 2014.
{
    if(is.null(batch) && is.null(batch2) && is.null(covariates)) return(as.matrix(x))
    if(!is.null(batch)) {
        batch <- as.factor(batch)
        contrasts(batch) <- contr.sum(levels(batch))
        batch <- model.matrix(~batch)[,-1,drop=FALSE]
    }
    if(!is.null(batch2)) {
        batch2 <- as.factor(batch2)
        contrasts(batch2) <- contr.sum(levels(batch2))
        batch2 <- model.matrix(~batch2)[,-1,drop=FALSE]
    }
    if(!is.null(covariates)) covariates <- as.matrix(covariates)
    X.batch <- cbind(batch,batch2,covariates)
    fit <- lmFit(x,cbind(design,X.batch),...)
    beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]
    beta[is.na(beta)] <- 0
    as.matrix(x) - beta %*% t(X.batch)
}

My understanding is that when you have just one known source of bias (say, different days in which an experiment was performed), covariates is = null. Thus, the function will be the following:

batch <- as.factor(batch)
contrasts(batch) <- contr.sum(levels(batch))
batch <- model.matrix(~batch)[,-1,drop=FALSE]

once a design matrix has been created..

fit <- lmFit(x,cbind(design,X.batch),...)
beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]
beta[is.na(beta)] <- 0
as.matrix(x) - beta %*% t(X.batch)

So that's the hard bit for a lay person... So once you have your 'dds' object (I am using DESeq2), design formula (es. ~condition), Limma's function adds the batch object just created and then it fits a linear model for each gene in the dataset.

  1. Is that correct?
  2. can anyone explain what is the beta object in this case? I guess the following strings will add the beta variable to the equation?

Sorry if my questions may be very naive for you but I am really trying to understand every single steps and in the meantime trying to study a bit of stats.

limma RNA-Seq batch-effect • 3.8k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
10
Entering edit mode
4.0 years ago

Obviously limma should not be used to model the raw or normalised counts from DESeq2 due to the fact that these counts will follow a negative binomial distribution, whereas limma's removeBatchEffect() function [and limma, generally speaking] will expect that your data is normalised and follows a normal distribution (to be specific: that the residuals after fitting the model follow a normal distribution).

In a practical setting, the following would occur:

  1. via whichever means, the user identifies a batch effect in their data, batch
  2. user includes batch in the design formula when using DESeq2, for example, ~ condition + batch
  3. when deriving test statistics for the main condition of interest, condition, DESeq2 will, internally, treat batch as a covariate and 'adjust' the test statistics for condition, including the p-values, obviously. We could just as easily derive test statistics for batch, and these would conversely be adjusted for condition. This is how covariate-adjustments work in regression models.
  4. user then wishes to use this data for downstream applications; so, transforms the normalised counts via rlog(..., blind = FALSE) or vst(..., blind = FALSE)

-----------

Now, we have performed our differential expression analysis using DESeq2 and adjusted for the effect of batch in the test statistics output only - no modification to our expression data has been made for batch, or, indeed, anything else in our design formula. However, for our downstream programs / methods, like clustering, PCA, and other cool stuff, we want to completely eliminate the batch effect directly from our expression data; so, we use:

limma::removeBatchEffect(assay(vst), batch = colData(dds)[,batch'])

Limma, via the code that you have posted above, will estimate the effect of batch for each gene by fitting a model of form ~ batch [for each gene], and then use these estimates ('beta coefficients') to directly modify the expression levels in your assay(vst) object. So, the 'beta' to which you inquire is a measure of the effect of batch for each gene. In a typical regression scenario, we would derive odds ratios from these beta coefficients - the odds ratio is the exponent of the beta coefficient.

Trust that this helps.

Kevin

ADD COMMENT
4
Entering edit mode

Specifically, a linear model is an equation of the form:

y = beta0 + beta1*x1 + beta2*x2 + beta3 * x3  + .... + epsilon

Lets say you have case where you have gene expression (as assay(vst), as Kevin says), so y is assay(vst). The xs represent the predcitors, so if formula is ~condition + batch, and batch can be "batch1", "batch2" or "batch3", then x1 will be 0 for WT and 1 for treated, x2 will be 1 if batch is "batch2" and x3 will be 1 if batch is "batch3". We often reffer to the x's in the vector form X.

The betas are reffered to as the coefficients. In the sort of gene expression models we are looking at here, where the output is log gene expression, the are equivalent to log fold changes.

episilon represents the random noise in the system - its the bit that isn't accounted for by any of our other predictors.

So we fit this model for each gene, including the batch effects. We then split it into three pieces:

 y = (Condition) + (Batch) + episilon

where

Condition = beta0 + beta1*x1
Batch = beta2*x2 + beta3 *x3

and say that if we only want to only get the condition dependent part of the gene expression, is can do:

Condition + epilison = y - (Batch) = y - beta2 * x2 + beta3*x3

We say that this is the "residual" after regressing out the effect of batch.

this is what is happening in this lines:

    beta <- fit$coefficients[,-(1:ncol(design)),drop=FALSE]

coefficients is a matrix that contains all the betas (beta0, beta1, beta2, beta3) for all the genes. This gets only those columns associated with batch.

then in this line:

as.matrix(x) - beta %*% t(X.batch)

We multiple the betas by the xs (%*% is matrix multiplication in R, so we multiple the 20,000 x 2 beta matrix by the 2 x nt(X.batch)` matrix to get a 20,000 x n matrix batch effects for each of the 20,000 genes in each of the n samples, and then subtract them from the expressions to leave only the Condition effects.

ADD REPLY
0
Entering edit mode

Thanks Kevin and i.sudbery for the big help in explaining me the nitty gritty details of this code. Sorry if this sounded like somewhat an offence for someone; for sure, didn't intend to open a debate.

Thanks you all for providing best explanation on the topic.

ADD REPLY
4
Entering edit mode

No problem, trying to understand the math is good. I was a worried earlier that you might get distracted by focusing on code steps you didn't understand to the detriment of understanding the concept and the math (because I see that happen often). But the answers have included concepts so it's a good outcome.

Don't forget that help is available for very function step. E.g., ?lmFit would explain what fit$coefficients is.

ADD REPLY
0
Entering edit mode

Thanks a lot, then. Have a good one.

ADD REPLY
2
Entering edit mode

Sorry if this sounded like somewhat an offence for someone

Don't worry. It is a perfectly valid question and an good example for a question where a user invested time and effort. Nothing wrong with it.

ADD REPLY
0
Entering edit mode

Good to know! happy to hear that..Thanks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6