Remove specific number of identical reads from fastq or bam files
0
0
Entering edit mode
5.5 years ago

Hi, I dealing right now with some ChIP-seq data generated from a very low number of cells. Data look so far good but I noticed that some loci got heavily amplified during library preparations which is I guess a consequence of working with low amounts of material. I looking now for a tool to restrict the number of identical reads per loci at for example 3 (e.g. if I have 10 identical reads 7 will be removed and 3 remain). As far as I read both picard tools as well as samtools remove duplicates in a all or nothing manner. Somebody has a handy solution for me (Iam biologist :p).

Thanks, Flo

ChIP-Seq duplicates • 1.0k views
ADD COMMENT
0
Entering edit mode

I am not immediately aware of such a tool. What is special about requirement of leaving three instead of just one? You could use clumpify.sh (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) which has an option to add a count field to the fastq header after deduplicating the data which you could use to keep track of how many duplicates were there originally.

ADD REPLY
0
Entering edit mode

Because its ChIP-seq and I would expect to have some duplicates simply because we reduce extremely genomic complexity especially in the case using just a few cells (additional lost of complexity simply by losing some DNA fragments after shearing). Iam not sure which exact number i will allow later its just to play a bit around but removing all of them is maybe to harsh in my case.

ADD REPLY
0
Entering edit mode

prinseq can remove duplicated sequences. If you have a high levels of read-duplication you may consider to remove them, if not, I think that use arbitrary filters may cause absolutely biased analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6