Question

Analysis of sequencing duplicates

0

Entering edit mode

6.4 years ago

DVA ▴ 630

I wonder if there is an existing software studies duplicates from NGS bam file. I understand that Picard and samtools can both mark/remove duplicates, but what if I want to count the frequency of each duplicated fragment?

I could do some string comparison in Python and categorize the duplicates (or its aligned location) into dictionaries, but I am not sure this is optimal. Thank you in advance for reading.

duplicate • 2.1k views

ADD COMMENT • link updated 6.4 years ago by GenoMax 141k • written 6.4 years ago by DVA ▴ 630

1

Entering edit mode

Maybe dupRadar?

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

but what if I want to count the frequency of reach duplicated region or fragment?

just use samtools view -f 1024 -c in.bam "chr1:234-567" ?

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry I guess I was not very clear in the post. I want to count the frequency of the duplicated reads, not regions - I do not know what regions to look at yet, unless I scan through the bam file first. Thank you so much for the reply.

ADD REPLY • link 6.4 years ago by DVA ▴ 630

0

Entering edit mode

~~so what about sorting the reads by names and counting the number of duplicate for the same name ? (I think this data is provided by picard)~~

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you very much for the reply. I'm not sure what do you mean by "name" - I thought every read has a unique name at the very start (starting with @)?

ADD REPLY • link 6.4 years ago by DVA ▴ 630

0

Entering edit mode

ah yes, sorry, I was wrong.

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2017-12-06

1

Entering edit mode

6.4 years ago

GenoMax 141k

You can use clumpify for this if you have original read data available (or you could convert the bam back to fastq using reformat.sh from BBMap suite). Use the option addcount to get information about counts added to fastq headers.

ADD COMMENT • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the reply. I'm trying to read more about "addcount" in clumpify, but could not find it - if you are one of the authors, do you know if this command looks for perfect match among the reads, or does it allow small portion of mismatches please?

ADD REPLY • link 6.4 years ago by DVA ▴ 630

0

Entering edit mode

By default clumpify allows two errors/substitutions in the reads when doing "clumping". You can use dupesubs=0 to allow only perfect matches. addcount appends a copies=N message to the fastq header. BTW: I am a user of BBMap suite, not author. That would be @Brian Bushnell. If you just run clumpify.sh on command line you will see the extensive in-line help.

ADD REPLY • link 6.4 years ago by GenoMax 141k