Question

RNAseq sva-seq drops number of DE genes?

0

Entering edit mode

6.0 years ago

brismiller ▴ 50

Hey everybody,

I have a question about the Surrogate Variables Analysis(SVA) implementation. I have an RNA-seq dataset counts matrix that I am piping into DESeq2 to identify differentially expressed genes. When the RNA libraries were prepared the cells were grown in batches with one wild-type strain per batch. I have included this variable into the DESeq2 ~Design as Sample_Match. To essentially explore and see if there were any other batch effects I implemented svaseq on the counts matrix and included 2 of the significant surrogate variables found to see if the results changed.

My assumption was that the surrogate variables found by svaseq would capture variations in my dataset not explained by biological variations, and thus the number of differentially expressed genes would increase. When I add two surrogate variables to the DESeq2 analysis two of the three genotypes have more differentially expressed genes while the third had less (~200 of the original ~1800 genes).

In this case is svaseq capturing biological variation? And am I implementing it correctly?

Relevant code below:

#  Dependencies
require(Rsamtools)
require(DESeq2)
require(GenomicFeatures)
require(GenomicAlignments)
require(genefilter)
require(pheatmap)
require(RColorBrewer)
require(ggplot2)
require(sva)
require(VennDiagram)

# Import & pre-process Count Data for DESeq2----------------------------------------------------------------------------------------

# Load sample information csv
csvfile <- file.path("sample_Info.csv")
sampleTable <- read.csv(csvfile) #  read csv

# Load count data (from featureCounts) and convert to matrix
countdata <- read.table("counts.txt", header=TRUE, row.names=1)
countdata <- as.matrix(countdata)

# SVA: Unsupervised calculation of surrogate variables for batch correction------------------------------------------------------
# get counts from counts file
# keep only genes with > 1 counts for all samples
countdata_filtered  <- countdata[rowMeans(countdata) > 1, ]  

# svaseq
mod1  <- model.matrix(~ sampleTable$Sample_Match + sampleTable$Strain)  # full matrix
mod0 <- model.matrix(~ sampleTable$Sample_Match)  # null matrix
# calculate number of surrogate variables (6 were found)
num_sv <- num.sv(countdata_filtered, mod1, method ="leek")  
print(c("Calculated Number of Significant Surrogate Variables = ", num_sv))
svseq <- svaseq(countdata_filtered, mod1, mod0, n.sv = 2)

# DE Analysis with DESeq-------------------------------------------------------------------------------------------

coldata <- data.frame(row.names=colnames(countdata), svseq$sv[,1], svseq$sv[,2], 
                      factor(sampleTable$Sample_Match), sampleTable$Strain)
colnames(coldata) <- c("unsupervised_sv1","unsupervised_sv2", "Sample_Match", "Strain")
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, 
                              design= ~ unsupervised_sv1 + unsupervised_sv2 
                              + Sample_Match + Strain)
dds <- DESeq(dds)

# Generating Contrast and Results Tables from DESeq2-----------------------------------------------------------------------
#  Removes any gene in the count matrix with 0 or 1 counts
dds <- dds[ rowSums(counts(dds)) > 1, ] 

##  ΔRSP1 vs wt  ##
res_P1 <- results(dds, contrast=c("Strain","ΔRSP1","wt")) 
#summary(res_P1,alpha=0.05)

##  ΔRDF2 vs wt  ##
res_F2 <- results(dds, contrast=c("Strain","ΔRDF2","wt"))
table(res_F2$padj < 0.05)  #total genes with padj<0.05

##  ΔRDN2 vs wt  ##
res_N2 <- results(dds, contrast=c("Strain","ΔRDN2","wt"))
table(res_N2$padj < 0.05)  #total genes with padj<0.05

Here is the sample information from the csv file:

Sample,Strain,Sample_Match
F21,ΔRDF2,1
F22,ΔRDF2,2
F23,ΔRDF2,3
N22,ΔRDN2,4
N23,ΔRDN2,5
P11,ΔRSP1,1
P12,ΔRSP1,2
P13,ΔRSP1,3
SB1,wt,1
SB2,wt,2
SB3,wt,3
SB4,wt,4
SB5,wt,5

RNA-Seq SVA svaseq R • 2.7k views

ADD COMMENT • link updated 6.0 years ago by Kevin Blighe 87k • written 6.0 years ago by brismiller ▴ 50

score 3 · Accepted Answer · 2018-04-19

Hello,

My own take on this: do not do any adjustments using SVA-seq unless you are certain that there are effects in the data that are not related to biological effects and that are unidentified, i.e., do not use SVA-seq just for the sake of it or because someone said that you have to use it. By using it incorrectly, you run the risk of introducing more bias into your data than there originally was. Based on your wording, you do not appear confident that there are any effects other than the batch effect, which is already known.

In your situation, just include the batch variable in your DESeq2 design model, and that should be sufficient.

Kevin