Question

Simulated background and peak smoothing in MACS2

0

Entering edit mode

4.9 years ago

9606 ▴ 330

I cite from the Wiki page of the Encode Imputation Challenge:

We use the MACSv2 peak caller to compute the fold-enrichment and statistical significance of smoothed counts (150 bp smoothing window) of read-starts (5' ends of reads) at each position in the genome relative to the expected number of reads from a local Poisson simulated background distribution of reads at each nucleotide in the genome.

I have basic knowledge of what calling peaks means, but I find it hard to interpret the details of this short description, in particular:

What does it mean to do the smoothing of the counts?
What is a local Poisson simulated background? From what is is computed?

Any explanation or reference is much appreciated.

macs2 peakcalling background • 1.6k views

ADD COMMENT • link updated 4.9 years ago by ATpoint 81k • written 4.9 years ago by 9606 ▴ 330

score 2 · Accepted Answer · 2019-05-30

What they seemed to count across the genome is the 5' position (=the start) of every read so this is 1 nucleotide. Doing so, you typically get a pretty uneven coverage profile (see below) which makes it difficult to really define a summit position per peak so the position where most reads accumulate within the peak. Smoothing means to extend these 5' positions in each direction to create a more even coverage profile which then helps finding summits.

enter image description here

In this screenshot use see exactly the same dataset, but the top one is smoothed by 75bp in each direction of the 5' end of the read while the bottom one is only the 5' position. Technical detail: In this case it is ATAC-seq data where the read end represents the position where the "ATAC-seq enzyme" (proper terminology would be the "transposase") integrates into the genome.

For the Poisson question, you should read the original MACS paper which describes the underlying concepts of how backgrounds are estimated. In a nutshell (and from what I understand) the concept is that given you have a candidate peak, the peak caller will check the regions up- and downstream of that candidate peak in a certain window (I think somewhat 10kb, check the paper for details) and then use Poisson statistics with respect to the total genome size and the read count to calculate if the peak enrichment is indeed significant over the signal in the flanking region. Doing so, it separates significant enrichments (=peaks) from local noise.