Question

Problem with memory requirements for SPades

0

Entering edit mode

4.3 years ago

Tospohh ▴ 10

I am trying to assemble Illumina MiSeq paired-end reads from two Plasmodium falciparum samples. The genome is only about 20Mb long. The two samples are quite similar in the number of sequence reads with both having about 900,000 read pairs. I ran SPades on a machine with 250GB memory and one of the two samples was successfully assembled while the other one consistently crashes because it runs out of memory. I'm trying to understand why they are behaving so differently and what I can do about it now. Here is the commandline, wich is the same for both samples, just different FASTQ files, of course

spades.py -t 8 -o assembly_sample1b -1 fastq/1-1.fastq -2 fastq/1-2.fastq

I'm running this on a cluster node with 250GB memory. The LSF job manager output shows that the maximum memory used was 33927 MB.

The job that keeps failing shows that 378014 MB was required, i.e. 128GB more than the node had available. Consequently, the job was terminated by LSF and this keeps happening when I run again.

How can the memory requirements for two similar sized samples of reads from the same genome be so different? Can I reduce the memory requirements somehow? Speed isn't of the essence for me. Thanks!

CORRECTION The numbers of reads are actually 5.2 million for sample 1 (the one that works) and 5.6 million for sample 2. So it is a bit more than I thought but still they are quite similar numbers which makes we wonder why one requires so much more memory than the other. Thanks!

Assembly spades • 1.9k views

ADD COMMENT • link 4.3 years ago by Tospohh ▴ 10

0

Entering edit mode

Have you considered normalizing the data before doing assemblies? You may have an over abundance of coverage here.

ADD REPLY • link 4.3 years ago by GenoMax 142k

0

Entering edit mode

No, I wasn't aware of that. I haven't done a lot of assemblies and the last one was years ago. So from reading the manual you are referring to, I gather that I should aim for about 100x coverage. Is that a general recommendation or does it depend on the genome in question and the assembler?

ADD REPLY • link 4.3 years ago by Tospohh ▴ 10

0

Entering edit mode

It is a general recommendation and every genome is going to be different. So you will need to experiment some.

ADD REPLY • link 4.3 years ago by GenoMax 142k

0

Entering edit mode

you can set the memory manually (use -m option). Default is 250 Gb. Pay attention to tmp directory as well. what is the maximum k-mer size you are using, in assembly?

ADD REPLY • link 4.3 years ago by cpad0112 21k

0

Entering edit mode

I tried various settings for -m but whatever I use, it always tries to use more than is available. And of course the above command was run on a machine that did have 250GB RAM but it ended up requesting almost 380GB despite -m being 250 by default. Max k was (automatically) set to 77

ADD REPLY • link 4.3 years ago by Tospohh ▴ 10

0

Entering edit mode

see if you can run kmer genie on your raw data and find out the optimum k-mer size. If it is below 77, you might want to reduce the k mer size so that memory requirements would be less. Usually higher the k-mer size, more is the RAM required. If spades is failing, you can use megahit as alternative. In my usage, megahit is less RAM hungry compared to spades.

ADD REPLY • link 4.3 years ago by cpad0112 21k

0

Entering edit mode

Thanks! Wasn't aware of that tool either, will try it out!

ADD REPLY • link 4.3 years ago by Tospohh ▴ 10

0

Entering edit mode

btw, apparently '-m' option is a precaution. Please follow this official thread on on out of memory issue: https://github.com/ablab/spades/issues/19

ADD REPLY • link 4.3 years ago by cpad0112 21k