Question

Different mappability/effective genome fraction with paired end reads? (ChIP-Seq)

2

Entering edit mode

8.0 years ago

endrebak ▴ 960

I want to add support for paired-end reads in my ChIP-Seq software, which in itself is straightforward, but I came to think of a complication: the mappability should be much higher for PE reads. How do I deal with this? (MACS is another piece of software that handles paired-end reads, and I cannot see them mentioning it anywhere in their docs.)

It should be higher obviously, but how much?

Edit:

These people claim to have come up with a solution:

Third, we investigate how the use of paired-end sequencing or mate-pair libraries relates to mappability. To this end, we predict and quantify how many of the pairs obtained from a typical DNASeq experiment can be rescued by taking advantage of the uniqueness of one of its reads and of the distance information for the pair.

I will have to look at their paper Fast Computation and Applications of Genome Mappability

ChIP-Seq • 2.1k views

ADD COMMENT • link updated 7.5 years ago by Biostar 20 • written 8.0 years ago by endrebak ▴ 960

1

Entering edit mode

How much higher will depend on the insert size (median and variation) as well as the read length. I had rather hoped that you had come up with an elegant way to do this in epic :)

ADD REPLY • link 8.0 years ago by Devon Ryan 104k

1

Entering edit mode

I'm quoting from memory/gut feeling. Mappability doesn't increase that much passed a read length of ~40bp. In my opinion the main advantage of PE reads is in detecting PCR duplicates rather then in increasing the number of high mapq reads.

ADD REPLY • link 8.0 years ago by dariober 14k

2

Entering edit mode

Thanks, both of you. Sb. in my group meant the same thIng, namely that regular EGS was a good (but a bit conservative) approximation. Probably not something I should think too much about for now.

ADD REPLY • link 8.0 years ago by endrebak ▴ 960

1

Entering edit mode

For chip your 'signal' is the whole fragment pulled down by the IP right - so for single-end reads you have to extend your reads somehow. You dont have to edit the BAM file or BigWig file (although maybe you should modify the latter), but you have to, in-memory at least, treat SE reads as longer than they really are, if you want comparable results and the same algorithm for PE and SE reads. The worst thing you can do is treat PE reads like SE reads and standardize for that..

How you extend depends on how much computational effort your willing to put into it. One method that is popular is to take the mean/median insert size and extend all reads by that much. This is very crude and results in very over-smoothened data. A better approach is to take the insert size distribution, ideally from pre-sequencing experiments or alternatively inferred from the SE reads themselves after sequencing, and extend each read by some value in that distribution, the probability of which is based on the distribution probability. This results in data that looks a lot more like PE data, but is still far from perfect. The best method is to get local-insert size estimates, by calling peaks on the forward-strand reads only, then on the reverse-strand reads only, the find the peak-offset between the two, which should be roughly the same as the median insert size (or even better, use the whole peak distributions to estimate the theoretical local distribution), and then extend reads by that much in that area.

Basically, with SE data your first step should be to make it look like PE data, then run both PE and SE data through the same processes. Whilst mappability may not change much between long SE reads and PE reads, the big difference is that ChIP signal is the fragment - not the start of the fragment. In fact, I would even go as far as to say that ChIP signal is a normal distribution around the center of the sequenced fragment..

ADD REPLY • link 8.0 years ago by John 13k