Question

Should the reads of biological replicates be combined for ChIP-seq peak calling?

4

Entering edit mode

9.8 years ago

Ian 6.0k

A philosophical question, but one that has been posed to me by two independent PIs in the last month. If there are two biological replicates for the same ChIP sample of a ChIP-seq experiment and one of them performs "better" than the other (in terms of number and quality of detected regions) should they be combined?

My own personal take on this is that they should not be combined. Peak calling should be performed as replicates and then the intersect between the replicates taken. But I would be interested to see if anyone can see merit in combining biological replicates.

Thanks.

ChIP-Seq replicates • 7.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Ian 6.0k

0

Entering edit mode

9.8 years ago

Ming Tommy Tang ★ 3.9k

There is no "correct" way to analyze the data. I think you can combine the reads and then call peaks. at the same time, you call peaks for each data set. Then you can intersect the peaks to see how many overlap.

The bottom line is that whether the data make sense in relation to the specific biological questions you are asking.

I just gave my 5 cents.

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Ming Tommy Tang ★ 3.9k

Ram · Accepted Answer · 2014-07-04

The first thing I do prior to any of this is, post-alignment, run some QC using the ENCODE tools, in particular the phantompeak tools strand cross-correlation (NSC, RSC, and to some degree QTag) analysis and PCR bottleneck metrics. If you have biological reps then IDR is a very good idea as it checks consistency between replicates. CHANCE is also a highly recommended tool for QC. Note: IDR only works for TF and is not really developed for broad peaks (histone modification), so keep that in mind.

Do these on all your samples, including inputs; you should see no cross-correlation in inputs and decent cross-correlation on your IP. The level of the latter being dependent on whether it is narrow vs. broad-peak; if you are doing TF then use the criteria outlines in the phantompeak tools page. We have seen some cases with inputs where you will see some cross-correlation, possibly indicative of biases in fragmentation (open chromatin fragments more easily than closed, for instance), and these will influence your peak calls. If you have access to a Covaris I highly recommend it as we have found the best overall results from samples prepped from those, but they're quite pricy.

I also recommend if you are performing ChIP-Seq runs that ENCODE has previously performed (same organism, same factor) you can run a comparison against theirs to get an idea of how your samples come out (there is an on-line speadsheet on the ENCODE site with this information).

Very useful: Use IGV and calculate coverage metrics (TDF). Run some preliminary peak calls on your best sample and visually check peaks. Use ENCODE data sets from IGV to compare to if possible!

After all that, if the samples are quite comparable based on IDR and peak metrics then combining them is probably fine for peak-calling, but if they are different then you are bound by your worst sample and thus would likely be introducing noise into the peak calling (e.g. false positives).