Question

Question about deduplication in a highly repetitive genome

0

Entering edit mode

3.4 years ago

varunorama ▴ 80

Hello Biostars,

I have been analyzing WGBS data for our organism, which has a highly repetitive genome. I am using the Bismark pipeline and mapped the reads using bowtie2 within the Bismark pipeline. Typically, the pipeline recommends that deduplication should be done on WGBS datasets in order to remove PCR-based duplication. I ran this deduplication step (deduplicate_bismark) and then extracted methylation statistics based on the deduplicated data and the non-deduplicated data to see how the data differed.

Some initial findings show that deduplication is removing ~40% of the data, suggesting that these regions are PCR-based duplicates. Furthermore, the overall coverage of CpG sites is greatly reduced; from an average coverage of 4x (for the non-deduplicated data) to about 1x (for deduplicated data).

Given the large reduction in data and coverage, and that I am working with an organism with a highly repetitive genome, I am wondering if deduplication should indeed be implemented in this case. Is deduplication still needed in this case, or not? Additionally, If there are any QC steps needed to make a more informed decision, I would like to hear them!

Thank you!

sequencing WGBS • 1.5k views

ADD COMMENT • link 3.4 years ago by varunorama ▴ 80

score 0 · Accepted Answer · 2020-12-09

0

Entering edit mode

3.4 years ago

varunorama ▴ 80

Felix Kruger posted a very insightful and thoughtful response to this question on the Bismark GitHub page.

https://github.com/FelixKrueger/Bismark/issues/400

ADD COMMENT • link 3.4 years ago by varunorama ▴ 80