Question

Removing Duplicates before aligning

0

Entering edit mode

5.8 years ago

newbinf • 0

Hi,

I'm trying to get variants from amplicon-based sequencing reads. These reads have: primer adapters and barcodes on both ends. I'm looking into the GATK pipeline and Samtools/VarScan pipelines.

I was able to remove the primer sequences on both sides using cutadapt.

Next, I aligned my reads using BWA-mem. Then, I removed duplicate reads (to remove PCR duplicates) using SamTools' markdup. However, aligning removed the barcodes on both ends and deduplicating removed most of my reads. I'm looking into Picard's MarkDuplicates, but that also does not seem to be applicable to amplicon-based reads because it's based on the start position of the reads and would delete a majority of my reads.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

Thank you!

DNA-Seq next-gen GATK • 4.7k views

ADD COMMENT • link updated 5.8 years ago by gb ★ 2.2k • written 5.8 years ago by newbinf • 0

1

Entering edit mode

Do not remove duplicates with amplicon data, your reads are, by definition, all duplicates. Aligning with BWA should not remove the barcodes, they should have been soft-clipped, but should still be there.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

You want to keep one correct read, and lots of reads with errors? Why remove identical reads?

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

I want to delete reads with the exact same sequences (including barcodes) so that I can eliminate any PCR duplicates. I want to make sure that my future variant analysis is not biased because of PCR duplicates.

ADD REPLY • link 5.8 years ago by newbinf • 0

1

Entering edit mode

You have amplicon-based reads. By definition, all reads you see are PCR duplicates.

ADD REPLY • link 5.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello newbinf,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY • link 5.8 years ago by h.mon 35k

2

Entering edit mode

5.8 years ago

Dattatray Mongad ▴ 370

If you are familiar with QIIME pipeline, it has clustering process in its workflow which clusters all sequences on basis of sequence similarity and then followed by picking representatives from each cluster. If you want to remove duplicates (i.e 100% similar), then use pick_otus.py using usearch method with sequence similarity 100%. After that pick representatives using pick_rep_set.py.

Ideally, your final outut of pick_rep_set.py should contain reads without duplicates.

Good luck I have tried this.

ADD COMMENT • link 5.8 years ago by Dattatray Mongad ▴ 370

score 5 · Accepted Answer · 2018-07-17

5

Entering edit mode

5.8 years ago

gb ★ 2.2k

You should not just delete the same sequences, it is information that other tools can use. You should use USEARCH or VSEARCH dereplication.

USEARCH: https://www.drive5.com/usearch/manual/cmd_fastx_uniques.html

VSEARCH:

vsearch --derep_fulllength [input.fa] --output [output.fa] -sizeout -uc uc_out

With those tools you do "remove" the duplicates. If a sequence has 30 duplicates is keeps one and places a "size:30" in the fasta header. The abundance of these duplicates is usefull. Because you amplify DNA you suppose to have more then one sequence per amplicon. So reads with an abundance of one is most of the time junk.

Specifically in your case you want to check variants. So I would think that you should only use reads that are present at least a X number of times. If you have sequence A and B and A has an abundance of 300 and B of 2. If the difference between A and B is only one base it is probably a sequence error. If you have an abundance of A=300 and B=300 and they only differ one base it is probably a variant.

ADD COMMENT • link 5.8 years ago by gb ★ 2.2k

0

Entering edit mode

Wow, thanks for your answer. I also have some amplicon-based data and I didn't do mark duplicates job because fastp shows the duplication rates is more than 96%. Now I know I should use vsearch to mark duplicates.

ADD REPLY • link 5.8 years ago by MatthewP ★ 1.4k

0

Entering edit mode

Sorry, the comment was meant to newbinf. I moved it to his post now. But it also serves as a tip to you: if you found an answer helpful, and it solved your problem, you may also upvote it.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Hi, I know this is an old post. But what happen if the amplicons (this amplicon is a 312 bp sequence that correspond to the hipervariable V3 sequence of the HIV and the virus use it to enter the cell) are from viral variants and you want to know the abundance of the quasiespecies that use a specific sequence to enter the cell. how can you differenciate a duplicate from a real virus variant? Do I have to remove duplicates from the data? Because I could lose more than 50% of the reads. Can anyone help me? or give some tips to evaluate abundance of a viral population in a amplicon sequencing data thanks in advance

ADD REPLY • link 3.6 years ago by Jonsi • 0

2

Entering edit mode

You need to consider adding unique molecular indexes (UMI) to your amplicons during initial PCR (design primers with UMI). That way you would be tagging individual fragments as they are generated. Then use the UMI's later to identify true duplicates (if any) or use them for analysis you need.

ADD REPLY • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

thanks genomax, that's what I found in one paper recently. I'm really a newbie in bioinformatics, so the problem is how I'm going to analize the data that I have. Do I really have to remove the duplicates? Can I skip that process? Do you think if I'm not remove the duplicates, that would be a big bias for the analysis?

ADD REPLY • link 3.6 years ago by Jonsi • 0

1

Entering edit mode

You already have data in hand? I assume no UMI's then. If you have exact sequence duplicates those are not going to give you any concrete information since you don't know if they are PCR duplicates. You could use one of the tools mentioned above (or clumpify.sh from BBMap suite and keep counts of sequences seen multiple times while keeping one sequence. See: C: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remov ).

ADD REPLY • link 3.6 years ago by GenoMax 141k