Question

Rna-Seq Contaminants Assembling Likelihood

2

Entering edit mode

13.1 years ago

Lythimus ▴ 210

Assuming contaminants are introduced into my sample (some of my skin sloughs off or whatever) and I use RNA-Seq to sequence my samples, if I isolated those reads via taxonomical binning, what is the likelihood they would assemble using a de novo assembler such as ABySS or Velvet?

Similarly, if reads are mis-binned into an incorrect taxonomical group via a similarity-based classifier such as MEGAN using BLASTN, what is the likelihood those isolated reads assemble?

Obviously I'm not looking for exact numbers, I'm just not aware of how likely it is to get the kind of coverage and similarity necessary for these things to assemble.

assembly blast • 3.3k views

ADD COMMENT • link updated 13.1 years ago by Michael 54k • written 13.1 years ago by Lythimus ▴ 210

Ram · Answer 1 · 2011-04-06

No offense, but if you contaminate your samples with your own DNA by making a stupid mistake, you deserve to be punished ;) That would mean you should better repeat the experiment. But, seriously. In this question I found a really good cause for contamination that is unavoidable. The conclusion, it all depends on the sequence similarity between the sample and the contaminant.

I can't give you exact figures, but if you contaminate a sample from a closely related species the contamination will be hard to filter out prior to assembly. On the other hand a bacterial sample might be easy to clean from human DNA/RNA contamination.

You ask for if the contamination reads will assemble, this is maybe not the main concern. Of course they will assemble into contigs if you leave them in the mix and they have sufficiently large regions of overlap, which is likely. Then the assembler will most likely assemble them, but you will be easily detect those because the assembled contigs match the contaminant genome.

The real problem will be 'chimeric' contigs which contain a mixed sequence of sample and contaminant. The higher the nucleotide sequence identity of regions in the genomes of sample and contaminant, the higher the probability of chimeric assemblies. Those will be harder to detect.

For the chimeric contigs to be made, I would guess that the sequence similarity must be very high, such that it becomes indistinguishable from sequencing errors. The sequencing error depends on the technology and thus I would estimate that the assembly becomes problematic if the local sequence similarity between contaminant and sample reaches about the same level that would be expected from sequencing accuracy alone.

Did this answer your question?