I'm attempting to call variants using a reference graph I've generated first with minigraph ./minigraph -xggs -t16 f1.fa f2.fa f3.fa f4.fa f5.fa > PG_1.gfa and then vg./vg convert -g -a PG_1.gfa > PG_2.vg.
When I try to index, I'm prompted to modify the length of the nodes to 256 ./vg mod -X 256 PG_2.vg > PG_3.vg but it stalls out when I use this to index ./vg index -x -p PG_3.xg -g PG_3.gcsa PG_3.vg
Building XG index Saving XG index to Ldec_vg.xg
Generating kmer files...
Building the GCSA2 index...
InputGraph::InputGraph(): 2193921420 kmers in 1 file(s)
The PG_3.vg input is 1.3G in size, how much memory and CPUs should I be using to generate an index and call variants using fastq reads approximately 1.5-2GB in size?
Assuming that the graph is not too complex locally (in a 256 bp window), ~2 billion initial kmers in a single graph file should require 100-200 GB memory and 200-300 GB disk space in $TMPDIR.
GCSA construction uses a semi-external algorithm that works best when the graph is partitioned (e.g. by chromosome) into multiple .vg files. It can then reduce the memory usage significantly by loading kmers from one graph file at a time.