Question

Kraken2 is falsely assigning Klebsiella

0

Entering edit mode

7 months ago

SushiRoll ▴ 120

Hi everyone!

I've been running Kraken on a couple of clinical related samples (metagenomes) using Kraken2/Bracken using the silva Database. As a sanity check I run the same workflow on a mock community (SRR8073716) using different confidence values (0.1, 0.3 and 0.5). The classification was 90% spot on with only one missing genus but I noticed a fair amount of Klebsiella which was not supposed to be there. The abundance of Klebsiella was also high in my previous analyses but it didn't quite pop out because it made sense because of the nature of the sample. My question here is: Has anyone experienced false positive calls for Klebsiella? I couldn't find anything in literature neither related to Kraken nor to the database.

Thanks!

taxonomy Kraken2 • 958 views

ADD COMMENT • link 7 months ago by SushiRoll ▴ 120

score 2 · Answer 1 · 2023-10-04

2

Entering edit mode

7 months ago

Philipp Bayer 8.5k

I've also had false positive hits with Kraken, usually a --confidence of 0.1 gets rid of most of them.

This paper deals a bit with the fallout from such an issue: https://www.biorxiv.org/content/10.1101/2023.07.28.550993v1.full

The TCGA read data were analyzed with the Kraken program (11), a very fast algorithm that assigns reads to a taxon using exact matches of 31 basepairs (bp) or longer. The Kraken program is highly accurate, but it depends critically on the database of genomes to which it compares each read. Poore et al. used a database containing 59,974 microbial genomes, of which 5,503 were viruses and 54,471 were bacteria or archaea, including many draft genomes. Notably, their Kraken database did not include the human genome, nor did it include common vector sequences. This dramatically increased the odds for human DNA sequences present in the TCGA reads to be falsely reported as matching microbial genomes. This problem can be mitigated by including the human genome and by using only complete bacterial genomes in the Kraken database.

Kraken2 likes to assign as many reads as possible and it will go to close relatives if 'your' species is not in the reference database. You might have some other Enterobacteriaceae in your sample that are not in your database so Kraken2 assigns Klebsiella to these reads, with Klebsiella being the 'next best thing' in the database. Bracken might help you: it's designed to recalculate read counts for hits, you might get lucky and Bracken removes most reads for your Klebsiella hits so you can remove Klebsiella. Or you might need to expand your reference database, if you can, or check the specific Klebsiella hits in your reference database: are they from few incomplete Klebsiella genomes you could remove?

ADD COMMENT • link 7 months ago by Philipp Bayer 8.5k

1

Entering edit mode

Hi Philipp,

Thanks for your feedback. I found that 0.1 outputs includes most of the species the authors claim to be in their mock dataset but it also outputs some that are not supposed to be there. On the other hand 0.2 give less taxons with less correct calls, in both cases I get Klebsiella. I understand the problem you mention but there are two things that I can't fully wrap my head around:

1) According to the authors there should be no enterobacteria in the dataset, the species listed there would be better classified as other taxa in the absence of true hits in the database. 2) I've used this database with another approach (dada2) on 16S sequences and it outputs proper classification of multiple enterobacteria, I don't see why in this case I'm having false Klebsiella false positives on very different samples.

The database is supposed to be pretty well curated and complete (I'm talking about the SILVA 138 Ref NR99). I'm running the analysis again using a --confidence of 0.15 to see if something better comes out.

ADD REPLY • link 7 months ago by SushiRoll ▴ 120

1

Entering edit mode

The false-positive Klebsiella hits you're seeing, are they with several distinct Klebsiella genomes/isolates or are they with just a few? If you have contamination in the reference Klebsiella genome as the explanation then you should see hits with only one genome, not all of them. In that case the solution would be to remove that contaminated Klebsiella genome.

SILVA should have some issues with species-level curation:

For example, the higher proportion of species-level classifications with SILVA is encouraging, but does not necessarily indicate that this is a better result—indeed, the simulated classification results (Fig 3D) suggest that the species-level classifications achieved with SILVA have lower accuracy than the other database

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009581

That paper goes a bit into SILVA curation, there's a QIIME2 tutorial https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494 that might perhaps remove your false positives?

ADD REPLY • link 7 months ago by Philipp Bayer 8.5k

0

Entering edit mode

Hi Philipp,

Thanks for the suggestions. As to which genomes/isolates are the ones giving hits, I would say I'm pretty unsure since the classifier only goes to the genus level, and even if that wasn't the case, I would set it so since I'm aware that species level classification is sometimes flaky. I'll try the QUIIME reformatting and see if that helps. Many thanks again!

ADD REPLY • link 7 months ago by SushiRoll ▴ 120