Bug in JellyFish and DSK k-mer counting tool?
0
0
Entering edit mode
7.4 years ago
scchess ▴ 640

I have some k-mers that neither Jellyfish nor DSK was able to count. Is that a bug in the programs?

This is my script for JellyFish

  1. wget https://s3.amazonaws.com/sequins/others/ABCD.fq

  2. grep TTACATAACACCCATTGTGGCGGCTGCAAGT ABCD.fq | wc

    41 41 5166

    The 31 length k-mer is in the sample file and there are many of them.

  3. jellyfish count -m 31 -s 100M -C ABCD.fq -o 1.jf

  4. jellyfish dump 1.jf -c > 1.txt
  5. grep TTACATAACACCCATTGTGGCGGCTGCAAGT 1.txt

    returns nothing

Then I tried DSK,

  1. dsk -kmer-size 31 -abundance-min 0 -file ABCD.fq -out-dir ABCD -out ABCD.h5
  2. dsk2ascii -file ABCD.h5 -out ABCD.txt
  3. grep TTACATAACACCCATTGTGGCGGCTGCAAGT *.txt

    still returns nothing

As you can see, the 31 k-mer is in the input file (ABCD.fq) and there are many of them not being filtered away. I don't see a problem in the input file - it came out from a simulator. It is a valid FASTQ file. Why both programs weren't able to count my k-mer?

dsk jellyfish k-mer • 3.1k views
ADD COMMENT
2
Entering edit mode

Have you tried grepping for the reverse-complement?

Also, maybe you should try the BBMap package's KmerCountExact tool. It will, coincidentally, report that specific kmer by default, because it stores the higher, rather than lower, of each kmer/reverse-kmer pair. The other tools store the lower one.

ADD REPLY
2
Entering edit mode

DSK indeed will use a canonical form of the kmer that corresponds to the minimum value of the kmer and its reverse-complement. So in the example one should grep ACTTGCAGCCGCCACAATGGGTGTTATGTAA.

ADD REPLY
0
Entering edit mode

Brian. Thanks for the tip. I will give a try. I will try the BBMap tool and report back.

ADD REPLY
0
Entering edit mode

Why would both JellyFish and DSK count on reverse-complement?

ADD REPLY
3
Entering edit mode

It's faster and uses less memory to store either a kmer OR its reverse-complement instead of both, and for most purposes, it makes no difference because if a kmer exists in a genome, its reverse-complement must also exist because DNA is double-stranded. So, counting only one or the other cuts memory use in half.

ADD REPLY
2
Entering edit mode

Counting 'canonical' k-mers makes sense if your input genome assembly is incomplete..

jellyfish count --help
..
 -C, --canonical                          Count both strand, canonical representation (false)
..
ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6