Hi everyone!
I've been running Kraken on a couple of clinical related samples (metagenomes) using Kraken2/Bracken using the silva Database. As a sanity check I run the same workflow on a mock community (SRR8073716) using different confidence values (0.1, 0.3 and 0.5). The classification was 90% spot on with only one missing genus but I noticed a fair amount of Klebsiella which was not supposed to be there. The abundance of Klebsiella was also high in my previous analyses but it didn't quite pop out because it made sense because of the nature of the sample. My question here is: Has anyone experienced false positive calls for Klebsiella? I couldn't find anything in literature neither related to Kraken nor to the database.
Thanks!
Hi Philipp,
Thanks for your feedback. I found that 0.1 outputs includes most of the species the authors claim to be in their mock dataset but it also outputs some that are not supposed to be there. On the other hand 0.2 give less taxons with less correct calls, in both cases I get Klebsiella. I understand the problem you mention but there are two things that I can't fully wrap my head around:
1) According to the authors there should be no enterobacteria in the dataset, the species listed there would be better classified as other taxa in the absence of true hits in the database. 2) I've used this database with another approach (dada2) on 16S sequences and it outputs proper classification of multiple enterobacteria, I don't see why in this case I'm having false Klebsiella false positives on very different samples.
The database is supposed to be pretty well curated and complete (I'm talking about the SILVA 138 Ref NR99). I'm running the analysis again using a --confidence of 0.15 to see if something better comes out.
The false-positive Klebsiella hits you're seeing, are they with several distinct Klebsiella genomes/isolates or are they with just a few? If you have contamination in the reference Klebsiella genome as the explanation then you should see hits with only one genome, not all of them. In that case the solution would be to remove that contaminated Klebsiella genome.
SILVA should have some issues with species-level curation:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009581
That paper goes a bit into SILVA curation, there's a QIIME2 tutorial https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494 that might perhaps remove your false positives?
Hi Philipp,
Thanks for the suggestions. As to which genomes/isolates are the ones giving hits, I would say I'm pretty unsure since the classifier only goes to the genus level, and even if that wasn't the case, I would set it so since I'm aware that species level classification is sometimes flaky. I'll try the QUIIME reformatting and see if that helps. Many thanks again!