Question

Extracting contaminated reads from the sequenced data

1

Entering edit mode

5.3 years ago

manjumoorthy95 ▴ 60

I have the sequenced data of an organism. but it has three 16srRNA which belong to 3 different organisms. I guess it could be contaminated. How could I extract the contigs belonging to each organism present in the sequence data?

genome sequencing extraction illumina • 1.8k views

ADD COMMENT • link updated 5.3 years ago by GenoMax 141k • written 5.3 years ago by manjumoorthy95 ▴ 60

1

Entering edit mode

Hello,

bbduk might help you. From the web manual:

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

This will remove all reads that have a 31-mer match to PhiX (a common Illumina spikein, which is included in /bbmap/resources/), allowing one mismatch. The “outm” stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. “stats” will produce a report of which contaminant sequences were seen, and how many reads had them.

In the ref parameter you can define more than one reference. Have a look at bbduk.sh --help for more options.

fin swimmer

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

Thanks a lot. Let me check and let you know.

ADD REPLY • link 5.3 years ago by manjumoorthy95 ▴ 60

0

Entering edit mode

Hi Finswimmer,

I observed that increasing the k-mer value decreases the number of matched reads. What should be the ideal k-mer size for paired reads of length 150bp. How will the interpretation of results related to matched reads will change with changing k-mer size?

ADD REPLY • link 5.3 years ago by kspata ▴ 80

1

Entering edit mode

Because k= value is used to find the initial match if you set it too high then BBMap tools are not going to find any (or find less initial) matches. So no surprise there. Generally setting k= to something between 20-30 is fine for most applications. Smaller values require more memory.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you, this too worked for me.

ADD REPLY • link 5.3 years ago by manjumoorthy95 ▴ 60

score 2 · Accepted Answer · 2019-01-11

2

Entering edit mode

5.3 years ago

GenoMax 141k

If you truly feel that there are three organisms then you can use bbsplit.sh (from BBMap suite) to bin your reads into respective organismal pools. This will generally work well as long as the bacterial are distinct enough. You are able to decide what you want to do with reads that multi-map (map to all three reference genomes). e.g. keep in all bins, toss etc.

Use the answer here and ask if you have any questions: A: Tool to separate human and mouse ran seq reads

Since you have bacterial data you could turn off maxindel=0.

ADD COMMENT • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Yes thank you , I could seperate reads mapping to each reference genome. But still having one doubt. When I use bowtie or Bowtie2, the paired end reads of my data is not getting mapped to the reference genome, even when the 16s sequence of the reference genome is present in my data. Why could that happen?

ADD REPLY • link 5.3 years ago by manjumoorthy95 ▴ 60