Biostar Beta. Not for public use.
How to filter for samples that show good concordance across replicates in DESEQ2 count data from RNA-SEQ?
2
Entering edit mode
16 months ago
nancy • 70

I have some RNAseq data - n=2, treated vs control. I have carried out alignment & quantification via Hisat2 > Stringtie > PrepDE.py > DESeq2

I would like to filter out D.E. genes that show high variance between the two replicates. But I don't seem to have an objective way of doing this.

In the screenshot as shown, I'd like to retain data that looks like the top panel, while filtering out data that looks like the bottom panel. As you can see, genes with very poor concordance between reps also show up with good p-vals and adjusted p-vals, and I was wondering why that is so.

Thanks

1
Entering edit mode
1
Entering edit mode

n=2, treated vs control

With only two samples any comparison you make is entirely invalid and all time you spend is wasted. You have no means to figure out if the variance you observe is biologically relevant variance, biological noise or technical issues.

The absolute minimal number of replicates per group would be 3, but ideally, if budget permits, the quality of your data would be much better with 5 or more replicates per group.

0
Entering edit mode

Hi WouterDeCoster,

Thanks for showing me how to add figures to the main post. Very useful!

I quite disagree with as extreme a view as "any comparison with n=2 entirely invalid and all time you spend is wasted". Often, we do an exploratory n=2 to figure out things like most appropriate timepoint, dosage, etc. There are a couple 100s of genes that show excellent concordance with n=2 and significant D.E., (and indeed are biologically meaningful) and to me, that is a valid and useful takeaway from the n=2 experiment. Of course there is no doubt n>>3 will improve all measurements.

Yes, one cannot infer anything from the data points showing extreme variance when n=2, but my question was more about why, such data points are not being flagged in some way with a high p-value, but in fact, are assigned p-values as low as well replicated data points.

6
Entering edit mode
13 months ago
i.sudbery 4.7k
Sheffield, UK

For why these genes get called differential you have to understand DESeq2's approach to estimating per-gene dispersion.

The raison d'etre of DESeq2 and edgeR is to try to find accurate estimates for the each gene's dispersion with only a small number of replicates. They do this by sharing information between genes on the understanding that genes of a similar expression level are, on average, going to have a similar dispersion. Thus they estimate average dispersion for all the genes and fit a smoothed function to the mean-expression vs dispersion estimate. This represents their expected dispersion. Then for each gene they "shrink" that genes dispersion estimate towards the expected dispersion using a baysian proceedure. This means that genes with a small dispersion will have their dispersion increased and are therefore less likely to be significant, and for genes with a large dispersion, like these ones here, will have their dispersion decreased, making it more likely they will be significant.

In their paper they Love et al state:

If, on the other hand, an individual gene’s dispersion is far above the distribution of the gene-wise dispersion estimates of other genes, then the shrinkage would lead to a greatly reduced final estimate of dispersion. We reasoned that in many cases, the reason for extraordinarily high dispersion of a gene is that it does not obey our modeling assumptions; some genes may show much higher variability than others for biological or technical reasons, even though they have the same average expression levels. In these cases, inference based on the shrunken dispersion estimates could lead to undesirable false positive calls. DESeq2 handles these cases by using the gene-wise estimate instead of the shrunken estimate when the former is more than 2 residual standard deviations above the curve.

My guess is that this safety-value is "failing" in your case. This is probably because there is a large standard deviation in the distances of the gene-wise dispersion estimates from the expected dispersion (due to only having two reps), thus the large dispersions seen here would be less than 2sd from the expected curve and the shruken value for dispersion is used.

If you want to avoid this, you could use DESeq (rather DESeq2) and set the sharingMode="maximum". Here the maximum of the trend and the per-gene dispersion estimate is used, which is a more conservative approach (DESeq was way more conservative than other methods of its time).

0
Entering edit mode

Thanks for your detailed and well-explained response, I actually understood it and it makes sense to me now. And thanks for cleaning up my original post too. I learned something there too!

2
Entering edit mode
13 months ago
London

Check out the DESeq2 manual for Cooks Distance here. Cooks Distance is not applied to datasets with less than three replicates, but you can look at Mike Love's code sample here for an idea of how to apply it to your dataset.

maxCooks <- apply(assays(dds)[["cooks"]], 1, max)


From the Manual:

The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds, cooksCutoff=FALSE)

0
Entering edit mode

Thanks Andrew, I will check this function. However, it still strikes me odd that a p-value as low as p<10-5 is ascribed to a D.E. event where the two replicates show such little concordance. Since a low p-value indicates the event did not occur by chance, it is a bit counter-intuitive in these cases.

0
Entering edit mode

The variability would indeed be high for those genes, so, it's difficult for the stats to make calls there. Having more replicates would certainly help, and looking at Cook's Distance to check for outliers, like Andrew mentioned. I was just curious, though: do you not believe that it's a genuine call? - are you expecting every single gene to go down (not up) in the treated samples? Regarding DESeq2, due to the way that the data is normalised (which is regarded as among the best methods for RNA-seq, currently), one usually finds a roughly even split between genes going up and down in an experiment, in terms of fold-changes.

Another thing is to ensure that you filter out low count genes before normalisation, although those particular genes would still pass most low count filters that I know. The issue in the cases that you list are also 'burdened' by low counts, which can make comparisons difficult. For example, 11 and 2 is an effect size difference of >5 between those replicates, whereas 4 and 14 is an effect size difference of >3.

Out of curiosity, how did you generate the stats tables and which DESeq2 version were you using? They relatively recently introduced changes, in particular relating to log2 fold change shrinkage.

Edit: what was Cook's Distance for these genes, out of curiosity? 0.4 is the typical cutoff to call outliers.

0
Entering edit mode

Thanks Kevin,

1. No, I do not believe every gene to go down in treated, sorry I grabbed two different sets of data (one from the UP sheet and one from the DOWN sheet) as representative examples of high variance vs good replicates.

2. I didn't generate Cooks Distance- have only 2 reps? but let me tinker with it a bit. I will update.

3. We did filter counts >10 in at least 1 sample

4. We Generated stats table using below script. Start with fpkms from stringtie, run prepDE.py to generate countData and then below. DESeq2 version 1.16.1

code below:

#colData <- read.csv(PHENO_DATA, header = TRUE, sep = ",")
rownames(colData) <- colData$SAMPLE cat("Check all sample IDs in colData are also in CountData and match their orders:") all(rownames(colData) %in% colnames(countData)) cat("\n") countData <- countData[, rownames(colData)] cat("all(rownames(colData) == colnames(countData)):") all(rownames(colData) == colnames(countData)) #Create a DESeqDataSet from count matrix and labels dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) #pre-filtering keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] #Run the default analysis for DESeq2 and generate results table dds <- DESeq(dds) resultsNames(dds) as.data.frame(colData(dds)) res <- results(dds) #Sort by adjusted p-value and display (resOrdered <- res[order(res$padj), ])


Then, output the resOrdered as output file with FC etc.