Hello,
I would like to know if there are any tools available to find the correct percentage of duplication levels in FastQ files ?
Currently, I am using FastQC. However, FastQC gives an estimation. From FastQC manual:
To cut down on the memory requirements for this module only sequences which occur in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.
If you have any tools in mind, I would highly appreciate it.
Thanks in advance !
you can try CD-HIT and use summary statistics. Try also seqkit
rmdup
function with-D
option.PicardTools
? there is thisMarkDuplicates
tool, which marks duplicate reads and not sure but it could very well write out a summary of the amount of duplicates found .EDIT: not a valid approach here as it works on aligned BAM files as pointed out below
Thanks for the answer but
MarkDuplicates
takes as inputBAM
orSAM
files and notFastQ
.It should be possible to convert the FastQ file into an unaligned SAM or BAM if the alignment information itself is not used by Picard.
Picard uses the alignment info rather than the sequence info to calculate duplication.
yep, checked it as well ... scratch that thus from possible approaches in this post :)