Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

Best way to obtain reference and non-reference k-mer
1
1
Entering edit mode
12 months ago

Hi everyone,

I'm wondering what would be the best strategy to obtain the list of all k-mers in the graph with each k-mer labeled as reference (i.e., from specific path) or not-reference. A very nice to have extra feature would be to have, for each k-mer, the index(es) of its occurrence(s) in the reference (linear) space.

vg kmers ...
vg find -k ...


Both strategies have prons and cons. The former produces more succinct output but I haven't found a way to discriminate whether k-mers belongs or not to a path (nor I found an easy way to obtain the index). The latter has more extensive output, but also produces much larger outputs.

Thanks for any suggestion, Michele S.

vg • 332 views
3
Entering edit mode
12 months ago
Jouni Sirén ▴ 130

The easiest way is probably doing it outside vg:

1. Extract all kmers (kmer occurrences) with vg kmers.
2. Extract the selected paths in FASTA format with vg paths -v graph.vg -F -p path-names.txt > output.fa.
3. Determine the kmers in the paths using an external tool.
4. Compare the kmer sets with an external tool.

The path positions for non-path kmers are not always well-defined. There is some machinery for determining them (for alignments, not kmers), but even then, we often have to make arbitrary choices or give up trying.

0
Entering edit mode

Thank you very much for the prompt and detailed reply. I was considering using external tools, but I wanted to be sure that there were no better ways built in the toolkit.