Question

Issues calling variants with vg when using other tools upstream

1

Entering edit mode

3.4 years ago

brian.nadon ▴ 10

I have a graph of 10 different assemblies of the 5 Mbp salmonella genome created with pggb. As far as I can tell, the smoothed GFAs created from that tool look as expected, and aren't corrupted or anything.

I want to use vg map or GraphAligner to map some Illumina short reads (fastq files) from an SRA accession to the graph, and call variants in the SRA sample against the graph (with or without augmenting - I don't mind just genotyping the sample). However, I'm running into several roadblocks that I can't figure out.

My pipeline goes like this:

($gfa is my GFA file from PGGB's output, $fq1 and $fq2 are my SRA fastq paired read files.)

vg view -Fv $gfa > salm.vg
vg convert -g -t 38 -v $gfa > salm.vg
vg index -t 38 -x salm.xg salm.vg

GraphAligner --verbose -x vg -t 38 -g salm.vg -f $fq1 -f $fq2 -a salm.gam

vg pack -t 38 -x salm.xg -g salm.gam -Q 5 -o salm.gam.pack
vg call -d 1 -t 38 -m 1,2 -k salm.gam.pack -s SRR13061905 -a salm.xg > salm.xg.gam.vcf

There are a few issues I'm having here, in order:

For the first two steps, I have tried both vg view and vg convert as you can see. However, these give wildly different outputs - the "vg convert" result is over 2x larger in filesize for example. Is this intended, and if so, which is preferred when converting GFA -> vg?
When I attempt to create a GCSA index in the indexing step in order to use vg map, it fails due to an incredibly huge disk usage - over 2 terabytes - too much for my cluster. I understand this is expected for GCSA, since these files can be 100x or more larger than the graph as stated in the docs. However, setting a temporary filesize limit of e.g. 500G with "vg index -Z [...]" causes failure due to the temp files still being too big. Is there any way around this? I've been forced to try to use GraphAligner until I can figure this out, leading to my next issue...
vg pack seems to work ok, but "vg call" always produces a VCF that is either empty (without -a) or has nothing but rejected no-calls for every single position, failing the "lowad" low depth filter. The GAM output from GraphAligner seems valid, and successfully maps almost all the reads to unique positions. The stats look good for the GAM, but VG seems to think there is literally nothing at all in the GAM and totally rejects every single read. What's going on?

Any help with any of this is appreciated, as I'd really like to use these tools. We really want to use a variation graph instead of a reference-based approach for our project.

vg • 1.4k views

ADD COMMENT • link updated 3.4 years ago by Jouni Sirén ▴ 360 • written 3.4 years ago by brian.nadon ▴ 10

score 2 · Answer 1 · 2020-12-09

It looks like vg convert outputs the graph in uncompressed form, while vg view compresses it with gzip. In any case, both formats are now obsolete. The default graph format is now HashGraph, which is faster and requires less memory than the old vg format. You can convert the GFA into the HashGraph format with vg convert -g -a graph.gfa > graph.vg. Any vg subcommand that expects a graph should be able to read the HashGraph format.

GCSA construction may require excessive time and space if the graph is too complex. The normal solution is to simplify the graph with the vg prune subcommand. Ideally, if you have the assemblies as paths in the GFA file, you can use -u / --unfold-paths option without losing information. This is described in the "Indexing a small graph" -> "Complex graph" -> "With many paths" section of the index construction wiki page. Otherwise you will have to follow the "Without a reference or haplotypes" section, which simply removes the complex regions. If the pruning itself is too expensive, you may have to add -M 32 or even -M 16 to the vg prune command to remove high-degree nodes.

Unfortunately I can't help with the third issue.