Question

DESeq2 for differential gene expression on GTEx dataset

1

Entering edit mode

6.4 years ago

vikram ▴ 10

Hi,

I'm new to the field, and I'm trying to do a differential gene expression on the GTex dataset. My aim is to identify sets of genes which (with some confidence) identify each of the 50 odd tissue types in the said dataset. The dataset is (bulk) RNA-seq ~50k genes and ~12k samples. The resource I have at hand has ~50 CPU, each with 12 cores and plenty of RAM.

I have

browsed through the DESeq2 vignettes and I feel it may be a good fit.
Removed housekeeping genes, in the hope that it makes the task of the software a little easier.
Put the code to run

I was wondering if

My choice of algorithm is advisable, and
anyone has an estimate of how much time it may take the code to run

I'd be glad to give more details, if you need it.

Thanks for reading through. :-)

differential-gene-expression RNA-Seq GTex DESeq2 • 2.7k views

ADD COMMENT • link updated 8 days ago by Ram 43k • written 6.4 years ago by vikram ▴ 10

1

Entering edit mode

Removal of house keeping genes violates the assumptions of DESeq2, as it assumes that "most genes do not change". More specifically, DESeq2 assumes that "the median ratio will capture the size relationship" (original quote from Michael Love) between the samples, as this is the basis for the geomatric mean normalization. Intuitively I would therefore assume that whatever the result of the analysis will be, it will not be accurate. You might reconsider this step.

ADD REPLY • link 6.4 years ago by ATpoint 81k

0

Entering edit mode

Thanks for the reply! The only reason I removed HK genes was because the matrix seemed to be too large. (50k x 12k). It's taken several hours, and it still hasn't completed.

ADD REPLY • link 6.4 years ago by vikram ▴ 10

0

Entering edit mode

On which step is it getting 'stuck'? If you are trying the regularised log transformation (rld()), then it will take a very long time and, for large datasets, the variance stabilising transformation (vst()) is recommended.

As ATPoint says, you should not remove housekeeping genes. Most housekeeping genes don't even behave as we think the behave (i.e., their expression does alter very much between different tissues and different stages of cell activity).

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

So, I am doing a DESeqDataSetFromMatrix(), followed by DESeq(), along the lines of this tutorial

The output that I can see thus far is:

estimating size factors

estimating dispersions

gene-wise dispersion estimates

I may be mistaken here, but I think DESeq2 requires raw values for differential expression.

I also tried running DESeq, where it is stuck in the nbinomTest() stage.

ADD REPLY • link 6.4 years ago by vikram ▴ 10

1

Entering edit mode

Yes, you should be supplying raw counts to DESeq2, not FPKM/RPKM, or logged data. It should be raw integer counts.

It is quite common for it to take a while at that step, though. I'm not worried about the ~50,000 genes but more the ~12,000 samples! - that will take a very long time to process. You should just leave it overnight, even with 12 CPU cores. Just ensure that parallel is TRUE in the DESeq() function when you execute it?

On my laptop (4 CPU cores), running that part of the process for a dataset of ~30,000 genes and 550 samples takes ~4 hours. I will hold you in high esteem if you manage to normalise 12,000 samples!

Kevin

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

I guess I'll wait. I have set parallel = TRUE, but it seems like the net CPU utilisation does not change by much, and there seems to be too much swapping because memory utlisation is close to 100%.

ADD REPLY • link 6.4 years ago by vikram ▴ 10

2

Entering edit mode

As your case is rather specific, I would ask for advice in the DESeq2 forum at Bioconductor. Michael Love is outstandingly responsive and is for sure the best person to ask for speed optimization.

ADD REPLY • link 6.4 years ago by ATpoint 81k