Entering edit mode
6.4 years ago
vikram
▴
10
Hi,
I'm new to the field, and I'm trying to do a differential gene expression on the GTex dataset. My aim is to identify sets of genes which (with some confidence) identify each of the 50 odd tissue types in the said dataset. The dataset is (bulk) RNA-seq ~50k genes and ~12k samples. The resource I have at hand has ~50 CPU, each with 12 cores and plenty of RAM.
I have
- browsed through the DESeq2 vignettes and I feel it may be a good fit.
- Removed housekeeping genes, in the hope that it makes the task of the software a little easier.
- Put the code to run
I was wondering if
- My choice of algorithm is advisable, and
- anyone has an estimate of how much time it may take the code to run
I'd be glad to give more details, if you need it.
Thanks for reading through. :-)
Removal of house keeping genes violates the assumptions of DESeq2, as it assumes that "most genes do not change". More specifically, DESeq2 assumes that "the median ratio will capture the size relationship" (original quote from Michael Love) between the samples, as this is the basis for the geomatric mean normalization. Intuitively I would therefore assume that whatever the result of the analysis will be, it will not be accurate. You might reconsider this step.
Thanks for the reply! The only reason I removed HK genes was because the matrix seemed to be too large. (50k x 12k). It's taken several hours, and it still hasn't completed.
On which step is it getting 'stuck'? If you are trying the regularised log transformation (
rld()
), then it will take a very long time and, for large datasets, the variance stabilising transformation (vst()
) is recommended.As ATPoint says, you should not remove housekeeping genes. Most housekeeping genes don't even behave as we think the behave (i.e., their expression does alter very much between different tissues and different stages of cell activity).
So, I am doing a
DESeqDataSetFromMatrix()
, followed byDESeq()
, along the lines of this tutorialThe output that I can see thus far is:
estimating size factors
estimating dispersions
gene-wise dispersion estimates
I may be mistaken here, but I think DESeq2 requires raw values for differential expression.
I also tried running DESeq, where it is stuck in the
nbinomTest()
stage.Yes, you should be supplying raw counts to DESeq2, not FPKM/RPKM, or logged data. It should be raw integer counts.
It is quite common for it to take a while at that step, though. I'm not worried about the ~50,000 genes but more the ~12,000 samples! - that will take a very long time to process. You should just leave it overnight, even with 12 CPU cores. Just ensure that parallel is TRUE in the
DESeq()
function when you execute it?On my laptop (4 CPU cores), running that part of the process for a dataset of ~30,000 genes and 550 samples takes ~4 hours. I will hold you in high esteem if you manage to normalise 12,000 samples!
Kevin
I guess I'll wait. I have set
parallel = TRUE
, but it seems like the net CPU utilisation does not change by much, and there seems to be too much swapping because memory utlisation is close to 100%.As your case is rather specific, I would ask for advice in the DESeq2 forum at Bioconductor. Michael Love is outstandingly responsive and is for sure the best person to ask for speed optimization.