Question

Removing duplicates from 16S RNA data

1

Entering edit mode

9.5 years ago

Bioinfguy ▴ 30

Hello All,

I would like to start a pipeline to analyse 16S RNA data, however I am not sure weather to include the step of removing duplicates or not. The aim of metagenomics is to compare different samples and see how the microbiota is different within samples of different conditions/time points.

After aligning, each sequence will match a sequence from a database, which will be then mapped to a bacteria and assigned taxonomy. the aim is to find how microbial content changes within each sample. If I removed duplicates, each sequence will be represented once, then taxonomy to this organism will be assigned once! so I am losing data here, which is the content of unknown bacteria in a sample with quantity for each one.

I am not sure if I am perceiving this right or not! I appreciate your comments.

Regards,
Bioinfguy

metagenomics • 3.6k views

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Bioinfguy ▴ 30

Ram · Answer 1 · 2014-10-28

1

Entering edit mode

9.5 years ago

marina.v.yurieva ▴ 570

If I removed duplicates, each sequence will be represented once, then taxonomy to this organism will be assigned once!

I think you answered your question :)

Here is Mothur pipeline http://www.mothur.org/wiki/MiSeq_SOP where they don't remove duplicates. QIIME doesn't remove them either.

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by marina.v.yurieva ▴ 570

0

Entering edit mode

Thanks for the reply, however I found that in mothur pipeline they remove duplicates! if you search "unique.seqs" in the link you sent, you will find this.. I am now confused why they do this despite the fact that this will let us loose important data!

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Bioinfguy ▴ 30

0

Entering edit mode

I don't think so. They still count the total number of reads. They remove the duplicates to make the algorithm run faster but the count is still the same. Look at the further steps, they always have these

# of unique seqs:
total # of seqs:

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by marina.v.yurieva ▴ 570

0

Entering edit mode

yes they do keep the total number of reads in the statistics, but the output fasta file will contain only one representative sequence for all duplicates. This will align once with reference gene. right?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Bioinfguy ▴ 30

1

Entering edit mode

yep. The output fasta has only unique sequences but the count file keeps track of everything so at the end we count duplicates.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by marina.v.yurieva ▴ 570

0

Entering edit mode

so in this case I have to continue aligning using mothur. I was planning to use my custom pipeline, where I take this output fasta file of unique.seqs and blast it using self made tool then proceed to the pipeline, but in this case I have to include the count file in consideration to keep abundance.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Bioinfguy ▴ 30

1

Entering edit mode

you don't have to use mothur, that was just an example. you can use QIIME, it has many already created pipelines but you can also modify them. Keep in mind that clustering and aligning in QIIME is different from mothur. I don't know how big is your data but blasting it might be not such a good idea ;)

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by marina.v.yurieva ▴ 570

0

Entering edit mode

I hear about QIIME but I had problems trying to install it. Is there any easy way to install it rather than virtual machines?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Bioinfguy ▴ 30

1

Entering edit mode

download "anaconda" and install qimme using pip; so easy

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.5 years ago by Quak ▴ 490