Hi all,
I am not sure I understand correctly how we are supposed to use the vg pipeline when we don't have a variant set yet (in vcf format).
My objective is to use vg to investigate SVs and SNPs of mitochondrial sequences in a species complex. In the end I am interested in the coverage of each variant (ossibility of multiple mitochondria in those species). Maybe using graph genomes for mitochondria is not the best option either, but I wanted to avoid assembly issues (for example chimeras or misassembly of repetitive regions) when several mitochondria are found in one individual.
What I did so far from what I could grasp:
- Get mitochondrial sequences from genbank and produce a starting graph using pggb (this will capture some existing variants, but not all) + index this graph
- map PE reads with vg giraffe for hundreds of individuals
- augment the graph for each individual
- vg call for each individual producing an individual vcf
From there, I thought I could merge the vcfs, build a new graph containing all variation, map a second time and finally call a second time with all variation (then merge again all vcfs). However there does not seem to be a way to augment the reference graph (.vg format) using the merged vcf (vg construct only takes fasta sequences).
Am I misled in thinking I need to remap everything a second time? Is the first call enough to capture all variation in all individuals?
For example the answer to this issue suggests this is the way to proceed. Will I be missing variants by stopping at merging the first called vcfs. How do I get the same set of variants from all individuals from that point? I also need coverage of de novo variants identified in some individuals in the other individuals, which I don't get by stopping at the first call.
I will gladly take any advice or reference on this, thanks !