In MACS algorithm article (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r137) for chip-seq says:
"Given a sonication size (bandwidth) and a high-confidence
fold-enrichment (mfold), MACS slides 2bandwidth windows
across the genome to find regions with tags more than mfold
enriched relative to a random tag genome distribution. MACS
randomly samples 1,000 of these high-quality peaks, separates
their Watson and Crick tags, and aligns them by the
midpoint between their Watson and Crick tag centers (Figure
1a) if the Watson tag center is to the left of the Crick tag
center. The distance between the modes of the Watson and
Crick peaks in the alignment is defined as 'd', and MACS shifts
all the tags by d/2 toward the 3' ends to the most likely protein-
DNA interaction sites."
But shouldn't sonication size (in other words fragment size) be the same as 'd'? Why does MACS calculate the fragment size to shift the chip-seq tags, if the size is already given?
thanks, but still have some difficulty in visualising this...May the watson and crick tags of the same fragment have different lengths?
So what you are saying is that during chip-seq the DNA sequence where the transcription factor interacts isn't sequenced, only the flanking regions are sequenced and that is why we have to move the tags towards the center?
Not exactly. The shifting is meant to narrow down the binding site of the protein. ChIP-seq is typically sequenced single-end. That means either (not both!) the Watson or Crick strand are sequenced. Which one is selected randomly during the library preparation step when you ligate the Illumina adapters to the DNA after sonication. As said, this is random so in a high-quality data set, you should not see either Watson or Crick being overrepresented (=strand bias). Also, in a high quality library, for a given genomic region where the protein binds, you have plenty of independent reads covering that regions, originating both from Watson and Crick. That means you will see local enrichments of reads at the edges of the binding evens, as illustrated here. Shifting these local peaks should therefore give you the center of the binding event. Hope I made myself clear. If not, please feel free to ask for further details.
I understand better now, thank you very much. But now have more two related questions:) - in that case why MACS don't also shift the tags from control samples? - Does we also need to shift tags in ATAC-seq, given the fact there is no protein binding there?
Yes, in ATAC-seq, you should use
--nomodel
, because as you said, there is no protein target. Rather than that, the cutting events in ATAC-seq happen all over the accessable chromatin region, so shifting reads do not really make sense.Back to ChIP-seq: you do not shift the reads in controls, because you do not have a protein target in the control. The control is a sample that was treated equally to the ChIP sample, but using an unspecific antibody for the immunoprecipitation. The idea is to assess which regions come down due to unspecific binding, which then false-positively create/increase signal in the actual ChIP sample. MACS calls the peaks in the ChIP sample and then uses the control to calculate the fold change and pValue using the Poisson distribution. Essentially, it corrects what it calls in the ChIP sample by the background.
got it. thank you very much!