I am trying to assemble Illumina MiSeq paired-end reads from two Plasmodium falciparum samples. The genome is only about 20Mb long. The two samples are quite similar in the number of sequence reads with both having about 900,000 read pairs. I ran SPades on a machine with 250GB memory and one of the two samples was successfully assembled while the other one consistently crashes because it runs out of memory. I'm trying to understand why they are behaving so differently and what I can do about it now. Here is the commandline, wich is the same for both samples, just different FASTQ files, of course
spades.py -t 8 -o assembly_sample1b -1 fastq/1-1.fastq -2 fastq/1-2.fastq
I'm running this on a cluster node with 250GB memory. The LSF job manager output shows that the maximum memory used was 33927 MB.
The job that keeps failing shows that 378014 MB was required, i.e. 128GB more than the node had available. Consequently, the job was terminated by LSF and this keeps happening when I run again.
How can the memory requirements for two similar sized samples of reads from the same genome be so different? Can I reduce the memory requirements somehow? Speed isn't of the essence for me. Thanks!
CORRECTION The numbers of reads are actually 5.2 million for sample 1 (the one that works) and 5.6 million for sample 2. So it is a bit more than I thought but still they are quite similar numbers which makes we wonder why one requires so much more memory than the other. Thanks!
Have you considered normalizing the data before doing assemblies? You may have an over abundance of coverage here.
No, I wasn't aware of that. I haven't done a lot of assemblies and the last one was years ago. So from reading the manual you are referring to, I gather that I should aim for about 100x coverage. Is that a general recommendation or does it depend on the genome in question and the assembler?
It is a general recommendation and every genome is going to be different. So you will need to experiment some.
you can set the memory manually (use
-m
option). Default is 250 Gb. Pay attention to tmp directory as well. what is the maximum k-mer size you are using, in assembly?I tried various settings for -m but whatever I use, it always tries to use more than is available. And of course the above command was run on a machine that did have 250GB RAM but it ended up requesting almost 380GB despite -m being 250 by default. Max k was (automatically) set to 77
see if you can run kmer genie on your raw data and find out the optimum k-mer size. If it is below 77, you might want to reduce the k mer size so that memory requirements would be less. Usually higher the k-mer size, more is the RAM required. If spades is failing, you can use megahit as alternative. In my usage, megahit is less RAM hungry compared to spades.
Thanks! Wasn't aware of that tool either, will try it out!
btw, apparently '-m' option is a precaution. Please follow this official thread on on out of memory issue: https://github.com/ablab/spades/issues/19