Question

Remove duplicates from reads: best practices?

1

Entering edit mode

7.3 years ago

Matteo Schiavinato ★ 3.6k

Hey there,

I am curious about the deduplication aspect of treating the sequencing reads. So far, I did it a handful of times and always helped out in the end but I am aware that there is a debate on whether this is actually biologically correct to do or not.

What I usually do is to map the reads, get the bam file, and submit it to picard to MarkDuplicates.

What I want to know are 3 questions:

How do people deduplicate by mapping position using a psl file?
When would you say that deduplication is too risky?
I personally developed a tool (but there are some already) to remove duplicates by sequence identity. Without going in the details of the algorithm, I can tell you that the intersection of the removed reads between picard and my script is 99% (not 100%, though, there are some different reads). Is this approach theoretically correct?

RNA-Seq sequence alignment deduplication next-gen • 19k views

ADD COMMENT • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

score 8 · Answer 1 · 2017-01-10

8

Entering edit mode

7.3 years ago

GenoMax 141k

Depends on what kind of sequencer your data is from. There is a known issue where one may end up with optical duplicates on patterned flowcells (HiSeq 3/4K) depending on size of the inserts/loading concentrations. Even if you don't remove them you would want to know how many of those are present since they are artifacts of the technology.

@Brian recently added mark/remove duplicate functionality to a tool from BBMap suite called clumpify.sh. Advantage here is one does not need to align the data (to an external reference) to look for/mark/remove duplicates of all types and you can allow for errors in a controlled way.

ADD COMMENT • link 7.3 years ago by GenoMax 141k

1

Entering edit mode

To add to this:

I recommend optical duplicate removal for all HiSeq platforms, for any kind of project in which you expect high library complexity (such as WGS). By optical duplicate, I mean removal of duplicates with very close coordinates on the flow cell. And by duplicate removal, I mean removing all duplicate copies except one. Whether you should remove non-positionally-correlated duplicates (such as PCR duplicates) is more experiment-specific. And whether you should do any form of duplicate removal on low-complexity libraries is also experiment-specific, as you'll get false positives even when restricting duplicate detection to nearby clusters.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

May I ask you a link to document myself on the known issue of the flowcell?
If it doesn't use alignment to identify duplicates, what does it use? My script uses sequence identity from the centre of the read.

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

1

Entering edit mode

http://core-genomics.blogspot.com/2016/05/increased-read-duplication-on-patterned.html and this recent thread: Duplicates on Illumina

I meant to say alignment to an external reference (corrected above). Mark duplicates functionality is an extension of the clumpify algorithm (details are in this thread: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) that identifies sequences with similar sequences from a file. Those are rearranged to be near each other, which leads to efficient compression of the data files saving ~25% or so space. Optical duplicates are being marked by taking into account x,y coordinate positions of the read clusters and positional neighborhood space.

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

score 1 · Answer 2 · 2017-01-10

1

Entering edit mode

7.3 years ago

agata88 ▴ 870

In my opinion the best practice is to NOT remove duplicates from reads. I've done a lot of comparisons in which some of information were lost because I've performed MarDuplicates step. At the end it turned out that without this step all of variants were detected and Sanger confirmed.

ADD COMMENT • link 7.3 years ago by agata88 ▴ 870

0

Entering edit mode

I know and I partially agree, but in genome-wide studies you cannot verify by sanger every variant (we're talking about millions) and therefore you want to lose some data for the sake of the noise reduction to retrieve a list of variants that may be incomplete but you really, really trust. That's simply why I'm doing it RN, but I'm aware of the drawbacks.

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I was just saying that my comparison was confirmed by Sanger. Of course you won't check it for genome-wide studies. And the only option is to compare with known detected variants for data. I don't get what it is all for ... if I have good results without removing reads...

ADD REPLY • link 7.3 years ago by agata88 ▴ 870

0

Entering edit mode

Yes, I always try without in the first place, I agree with you. In this case there was just too much noise, and the dedupe actually helped a lot.

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.6k