Question

Capping the number of reads that align to a single k-mer

0

Entering edit mode

6.6 years ago

Morgan Ivy • 0

I am analyzing NGS (WGS) data, and I am looking for possible ways to reduce the level of "noise" and effectively filter false positives and low level DNA contamination. What I have noticed on alignment and sequence assembly is that small DNA fragments seem to be "over-amplified" and are skewing the overall read-count and identification of the true microbial community. I would like to add a setting to the script that caps the number of reads that may align to a single k-mer, with both values (# of reads and length of k) user manipulative. I'm not sure if there is a feature of BBMap that allows me to do this or one that I am not aware of. Any input is greatly appreciated.

sequencing alignment • 1.2k views

ADD COMMENT • link 6.6 years ago by Morgan Ivy • 0

2

Entering edit mode

I'm not sure if this addresses your issue, but you could do this:

1) Gather a set of kmers that are problematic. If this is because the count is too high, you can do so with kmercountexact.sh, which will print all kmers occurring in your dataset above some count threshold.

2) Filter out the reads containing those kmers with bbduk.sh, or else mask them in the reference so nothing will map to them.

Alternatively you can normalize the data to strictly downsample reads with an apparent depth above a certain value, based on kmer-counts, without actually getting rid of all of them. For example, it could leave reads with depth up to 100x untouched, but reduce things with 2000x coverage down to 100x (e.g. bbnorm.sh in=reads.fq out=normalized.fq target=100 mindepth=0)

Both of these approaches are kind of brute-force-ish; it would be better if you could identify specific artifacts and remove them instead. Could this be synthetic sequence, or a virus, or ribosomal DNA?

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks so much for the suggestions.

I will look into both of these - I am familiar with kmercountexact.sh, but I have not yet attempted to filter them.

I do not believe these are synthetic sequences or viruses - though I have yet to pull the reads and manually BLAST them. Adapter trimming has been applied and the taxonomic ID software and database I am currently using appear to sufficiently call synthetic construct and viruses as well as estimated %rRNA. I will revisit, though.

Again, thank you!

ADD REPLY • link 6.6 years ago by Morgan Ivy • 0