Question

memory requirments of velvet tool (de novo assembly)

1

Entering edit mode

9.4 years ago

wangyang703092 ▴ 120

Hi folks, I'm using Velvet to do the genome assembly but problem arose. The species may have approximately 200M genome size with 4 lanes data, and 35 million 76 bp PE reads, separately(100x). When I run velvet with 31kmer and default parameters, the 64G-RAM server used almost 100% RAM and it was hard to ssh it. So here is the question,how much RAM do server need to run velvet or SOAPdenovo2?

denovo-assembly • 4.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.4 years ago by wangyang703092 ▴ 120

Ram · Answer 1 · 2014-11-16

Using the Velvet memory calculator

Memory Usage and Coverage

Memory      32 Gb
Coverage    26.6x

As Brian said, sequencing errors, adapters contamination, heterozygous genome, etc will all increase the memory requirement.

SOAPdenovo2 will need a lot less RAM than velvet, the most RAM efficient way to run SOAPdenovo2 is to run sparse_pregraph command to construct sparse kmer-graph.

Ram · Answer 2 · 2014-11-16

The memory requirement tends to increase with the number of unique kmers. So, the more data, the bigger the genome, and the higher the error rate, the more memory will be needed.

Thus, you can reduce the memory requirements (and often get a better result) by quality-trimming or filtering, contaminant removal (both synthetic and natural, such as human contamination), and adapter-trimming. After that, you can further decrease memory requirements by error-correction, and by subsampling or normalizing the input data to a much lower level. And, ultimately, you will probably get a much better assembly with a kmer longer than 31; perhaps around 41-49 with high coverage 76bp reads. Sometimes it's also useful to split out the ribosomal, mitochondrial, and chloroplast parts of the genome (which may have a much higher coverage than the rest) and assemble them separately; this is often possible by depth-binning.

Sometimes you can see contamination peaks in the insert-size histogram (for synthetic contaminants) or gc histogram (for genomic contaminants). BLASTing a few thousand reads against nt can often tell you which contaminants may be present. If your reads are overlapping, you can generate an insert-size histogram with BBMerge and look for very sharp peaks, which are typically synthetic contaminants.

You can do quality trimming, filtering, contaminant removal, adapter trimming, subsampling, and gc histogram generation with BBDuk. For human removal (or other genomic contamination from large genomes with references) I suggest BBMap instead as it has higher specificity. After trimming and contaminant removal, you can do error-correction and normalization with BBNorm to reduce coverage and selectively concentrate real genomic kmers; or, subsample. If you normalize, a target depth of 30x to maybe 60x is probably optimal for Velvet though it depends on the kmer size you use for assembly (bigger kmer needs more coverage) and whether the genome is diploid.

These are all part of BBTools, and each has a shell script (bbduk.sh, bbmap.sh, and bbnorm.sh) which will display usage information.