Biostar Beta. Not for public use.
Extracting contaminated reads from the sequenced data
0
Entering edit mode
18 months ago

I have the sequenced data of an organism. but it has three 16srRNA which belong to 3 different organisms. I guess it could be contaminated. How could I extract the contigs belonging to each organism present in the sequence data?

1
Entering edit mode

Hello,

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

This will remove all reads that have a 31-mer match to PhiX (a common Illumina spikein, which is included in /bbmap/resources/), allowing one mismatch. The “outm” stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. “stats” will produce a report of which contaminant sequences were seen, and how many reads had them.


In the ref parameter you can define more than one reference. Have a look at bbduk.sh --help for more options.

fin swimmer

0
Entering edit mode

Thanks a lot. Let me check and let you know.

0
Entering edit mode

Hi Finswimmer,

I observed that increasing the k-mer value decreases the number of matched reads. What should be the ideal k-mer size for paired reads of length 150bp. How will the interpretation of results related to matched reads will change with changing k-mer size?

1
Entering edit mode

Because k= value is used to find the initial match if you set it too high then BBMap tools are not going to find any (or find less initial) matches. So no surprise there. Generally setting k= to something between 20-30 is fine for most applications. Smaller values require more memory.

0
Entering edit mode

Thank you, this too worked for me.

2
Entering edit mode
24 days ago
genomax 68k
United States

If you truly feel that there are three organisms then you can use bbsplit.sh (from BBMap suite) to bin your reads into respective organismal pools. This will generally work well as long as the bacterial are distinct enough. You are able to decide what you want to do with reads that multi-map (map to all three reference genomes). e.g. keep in all bins, toss etc.

Use the answer here and ask if you have any questions: A: Tool to separate human and mouse ran seq reads

Since you have bacterial data you could turn off maxindel=0.