Question

DESeq2 unequal sample sizes

1

Entering edit mode

6.6 years ago

glados ▴ 40

Hi,

I'm using DESeq2 to look for differentially expressed genes. I have three groups with different individuals: Treatment A, Treatment B, and Control.

Sample sizes (individuals) differ between the groups: A=9, B=5, Control=5.

If I want to test A directly against B, the sample size difference is of no problem, I understand that (asked already in this question: https://www.biostars.org/p/90421/).

But, when I want to test A vs Control and see how that differs from B vs Control, I worry about the unequal sample sizes. I expect the A vs Control comparison to find slightly more significant genes just because of more accurate dispersion.

How should I deal with this problem? Is 9 vs 5 not a big enough difference to worry about, or should I randomly subsample 5 individuals from treatment A (i.e. discard 4) and only use those in the comparison with the Controls?

Grateful for any advice.

R deseq deseq2 • 4.7k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 87k • written 6.6 years ago by glados ▴ 40

score 5 · Answer 1 · 2017-09-19

Yes, when the numbers are unequal in this way, you will see many more genes passing your FDR Q value threshold. I faced this situation 2 years ago but for even more extreme/unequal sample distributions. I think that it's reasonable to use difference thresholds for each comparison.

Here is some code that allows you to visually check how FDR affects different comparisons by showing how many significant variables remain at each level of FDR - it also gives the nominal P-value equivalent at each FDR. in this way, you can choose a good cut-off.

options(scipen=999)

topT <- as.data.frame(AVsControl)
AVsControl.pvalues <- topT$pvalue[!is.na(topT$pvalue)]

#5% FDR
alpha <- 0.05
AVsControl.cutoff0.05 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

#1% FDR
alpha <- 0.01
AVsControl.cutoff0.01 <- max(topT$pvalue[topT$padj<alpha], na.rm=TRUE)

et cetera...

pdf("RankPValue.pdf")
        par(mar=c(5,8,5,8), cex=0.8, cex.main=0.8, cex.lab=0.8, cex.axis=0.8, mfrow=c(1,2))

        #A versus Control
            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.05)
            alpha <- 0.05
            if (any(showInPlot))
            {
                plot(
                   seq(along=which(showInPlot)),
                   AVsControl.pvalues[orderInPlot][showInPlot],
                   type="l",
                   pch=".",
                   main=paste("Significance cut-off @ 5% FDR = \n", round(AVsControl.cutoff0.05,9)),
                   xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")),
                   ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }

            orderInPlot <- order(AVsControl.pvalues)
            showInPlot <- (AVsControl.pvalues[orderInPlot] <= 0.01)
            alpha <- 0.01
            if (any(showInPlot))
            {
                plot(
                   seq(along=which(showInPlot)),
                   AVsControl.pvalues[orderInPlot][showInPlot],
                   type="l",
                   pch=".",
                   main=paste("Significance cut-off @ 1% FDR = \n", round(AVsControl.cutoff0.01,9)),
                   xlab=expression(paste("Rank ", italic("P"), " value (num. sig. genes)")),
                   ylab=expression(paste(italic("P"), " value")))
                abline(a=0, b=alpha/length(AVsControl.pvalues), col="red3", lwd=1)
            }
dev.off()

Screenshot