Tutorial:removeBatchEffect explained using base R linear models
0
5
Entering edit mode
12 days ago
nhaus ▴ 360

Hi all,

I always had a some difficulties to understand what removing the effect of a covariates actually means and what happens "under the hood". So I decided to make a small R example that shows what limma::removeBatchEffect does, by "implementing" the method just using base R linear models. It really helped me to understand so I though I would share it:

Batch correction while preserving a biological effect

library(limma)

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix

# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
    full_model <- lm(gene_expression ~ batch + treatment)
    reduced_model <- lm(gene_expression ~ treatment)
    # isolate the component of the gene expression predictions that is due to the batch effect.
    # Is the portion of gene expression that can be attributed solely to the batch, independent of the biological variable (treatment).
    batch_effect <- full_model$fitted.values - reduced_model$fitted.values
    adjusted_expression <- gene_expression - batch_effect # remove the batch effect
    return(adjusted_expression)
})


# Convert adjusted expression to a matrix for easier handling
adjusted_expression_matrix_manual <- matrix(unlist(adjusted_expression_manual), nrow = num_genes, byrow = TRUE)
colnames(adjusted_expression_matrix_manual) <- colnames(expression)
rownames(adjusted_expression_matrix_manual) <- rownames(expression)

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
adjusted_expression_limma <- removeBatchEffect(expression, batch = batch, design = design_matrix)

# Comparing the two approaches
comparison <- adjusted_expression_matrix_manual - adjusted_expression_limma
print("Difference between Manual and limma Approaches:")
print(head(comparison))

Batch correction if you do not want to preserve anything

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix
# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
    model <- lm(gene_expression ~ batch)
    adjusted_expression <- residuals(model) + mean(gene_expression) # Adding back the mean
    return(adjusted_expression)
})

# Convert adjusted expression to a matrix for easier handling
adjusted_expression_matrix_manual <- matrix(unlist(adjusted_expression_manual), nrow = num_genes, byrow = TRUE)
colnames(adjusted_expression_matrix_manual) <- colnames(expression)
rownames(adjusted_expression_matrix_manual) <- rownames(expression)

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
adjusted_expression_limma <- removeBatchEffect(expression,
    batch = batch,
    design = matrix(1, ncol(expression), 1) # essentially says, there are no experimental conditions, all belong to same category
)


# Print the results
print("Adjusted Expression Matrix (Manual):")
print(head(adjusted_expression_matrix_manual))

print("Adjusted Expression Matrix (limma):")
print(head(adjusted_expression_limma))

# Comparing the two approaches
comparison <- adjusted_expression_matrix_manual - adjusted_expression_limma
print("Difference between Manual and limma Approaches:")
print(sum(comparison))
limma effects batch removebatcheffects • 200 views
ADD COMMENT
2
Entering edit mode

(+1) "implementing" the method just using base R linear models.: I find implementing from basic starting point to be one of the best, if not the best, way to learn what a procedure does.

ADD REPLY

Login before adding your answer.

Traffic: 1459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6