Question

DESeq2: unable to allocate memory error

0

Entering edit mode

7.7 years ago

elizabeth.h ▴ 10

I'm running DESeq2 on small RNA sequencing data. I constructed a .csv containing the raw count of each unique small RNA sequence across all my datasets. In the past I've successfully used DESeq2 on similar data, but this time my file size is bigger: my CSV is >2Gb. I'm running into memory errors using x64-bit RStudio on a Windows machine with 64Gb RAM.

This is all I'm trying to do right now:

library(DESeq2)
sRNA <- read.csv("deseq_input_all.txt", header=T, row.names=1)
coldata <- read.csv("deseq2/coldata.csv", header=T, row.names=1)
dds <- DESeqDataSetFromMatrix(countData = sRNA, colData = coldata, design = ~ group)
dds <- DESeq(dds)

However, at the DESeq() stage RStudio maxes out the memory and stops. Am I doing something silly with the code above, or is my data too big for this analysis? Is it worthwhile running it on a Linux machine or running R separate from RStudio?

Any tips or advice is appreciated.

EDIT: There are 34 million rows of data.

dim(sRNA)
> [1] 34760467       21

R software error RNA-Seq • 3.2k views

ADD COMMENT • link updated 7.7 years ago by andrew.j.skelton73 6.5k • written 7.7 years ago by elizabeth.h ▴ 10

0

Entering edit mode

I'd tend to agree with @Carlo Yague's first point... 2GB of raw counts seems odd to me. Can you show the output of dim(sRNA)

ADD REPLY • link 7.7 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Unless you have thousands of samples, the counts table should not be 2 GB.

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

There are 34 million rows (unique sequences) in the count table.

ADD REPLY • link 7.7 years ago by elizabeth.h ▴ 10

1

Entering edit mode

I have a feeling that you "counted" unique reads in a fastq file. That's not going to be useful for you. Align those to a genome, generate counts with featureCounts or htseq-count on the resulting alignments and then use the counts from that. You'll suddenly find that you only have a few tens of thousands of rows, which makes rather more biological sense.

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

0

Entering edit mode

Exactly. Just to clarify, the counts table should have samples as columns and genes as rows, so 20k-50k rows depending on annotation (for human/mouse).

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

Correct. For smallRNAs this number of "genes" might be a bit different, but that's the gist. BTW, if you're mostly interested in a single type of small RNA then there are dedicated programs for most of them (e.g., mirDeep).

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

score 2 · Answer 1 · 2016-07-25

2

Entering edit mode

7.7 years ago

Carlo Yague 8.6k

I have two suggestions :

1) >2Gb is really big. Are u sure your data is what you think it is ?

2) before calling DESeq(), filter out rows with low expression to reduce the size of your dataset. For instance :

dds <- dds[ rowSums(counts(dds)) > 1, ]

see this tutorial for more information.

ADD COMMENT • link 7.7 years ago by Carlo Yague 8.6k

0

Entering edit mode

Thanks for that! I'll try it out.

This data represents all 34 million unique sequences present in at least one dataset - and it hasn't been filtered based on number of reads/count yet.

I've run this analysis in the past using the criteria that a sequence had to be present in at least 10 reads (count of >=10) in order to be included in the count table, however I want to compare the size factors between >10 reads and >1 read count tables to make sure that they're roughly equivalent.

ADD REPLY • link 7.7 years ago by elizabeth.h ▴ 10

score 1 · Answer 2 · 2016-07-25

1

Entering edit mode

7.7 years ago

andrew.j.skelton73 6.5k

If you truly have 2GB of integer count data, firstly I'd make sure that what you're counting is gene-like, and secondly, I'd use Limma Voom, as DESeq2 won't scale up to large sample sizes very well.

edit: 21 samples... What exactly are the features you're counting?

ADD COMMENT • link 7.7 years ago by andrew.j.skelton73 6.5k

1

Entering edit mode

I'm (attempting) using DESeq2 to identify DE small non-coding RNAs (e.g. miRNAs) between samples - the samples are 3 biological replicates of 7 tissues/conditions.

From what I understand based on questions I've found on Biostars about DESeq2, it's possible to use miRNA count data (for example) to identify DE miRNAs. In this case, I'm running a count table with all small RNA sequences in the dataset, and then identifying DE miRNAs from that. The counts represent the number of reads for each unique sequence in the dataset.

The process works when I use a smaller dataset (where I filter by count size >=10) and produces reasonable-looking results. I wanted to test now whether I get similar results without filtering by count size - which is why my count table is so large in this case.

ADD REPLY • link 7.7 years ago by elizabeth.h ▴ 10

0

Entering edit mode

You first need to assign all reads to specific miRNAs.

There are a few tools you can use to automate the process, such as:

miRge - http://atlas.pathology.jhu.edu/baras/miRge.html
miRDeep2 - https://www.mdc-berlin.de/8551903/en/
miRExpress - http://mirexpress.mbc.nctu.edu.tw/

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

The first 2 links are broken.

ADD REPLY • link 5.0 years ago by Ömer An ▴ 260

0

Entering edit mode

Yes, link rot is a problem. Luckily, Google still exists.

ADD REPLY • link 5.0 years ago by igor 13k