Question

What is meaning of true diagonal in Hi-C?

1

Entering edit mode

7.7 years ago

ankitplusplus ▴ 40

Hi everyone, In Hi-C contact matrix, i see non zero values in diagonal. What is the meaning of self contacting. Should i ignore the diagnoal for further calculation.

Thanks

Hi-C Chia-pet 3C 4C 5C • 4.4k views

ADD COMMENT • link updated 7.7 years ago by Anton Goloborodko ▴ 280 • written 7.7 years ago by ankitplusplus ▴ 40

PoGibas · Answer 1 · 2016-08-28

5

Entering edit mode

7.7 years ago

Anton Goloborodko ▴ 280

The true diagonal contains contacts between loci separated by distances below the size of a bin. The 2nd diagonal has such contacts too, i.e. contacts between pairs of loci located close to the bin boundary, but on the either side of it.

Our lab's usual recommendation is to discard two first diagonals, b/c they are contaminated by non-informative artifacts of the Hi-C procedure: unligated and self-ligated molecules. The former are just pieces of undigested and unligated DNA; the latter are formed when two ends of the same molecule get ligated and then the formed circle gets cleaved elsewhere. Both types of molecules looks like short distance contacts. Unligated DNA pieces look like two contacts with a separation of a few hundred bp and with the sequencing directions pointing toward each other along the genome. Self-ligated molecules usually have a separation of a few kb to 10 kb and have sequencing directions pointing away from each other. Both do not contain information about spatial organization. B/c unligated DNA and self-circles cannot be distinguished from "true" contacts formed by two distinct ligated molecules, their presence essentially invalidates all statistics on short-distance contacts. For this reason, we usually discard the first two diagonals of Hi-C matrices at high resolutions (up to a few tens of kb) or only the first diagonal for low resolution datasets (100kb+).

ADD COMMENT • link 7.7 years ago by Anton Goloborodko ▴ 280

0

Entering edit mode

Is this recommendation on diagonal bias (i.e., data processing workflow) published?

ADD REPLY • link 7.6 years ago by PoGibas 5.1k

1

Entering edit mode

uuuggghh, not really! :) The 4DN consortium is currently working on Hi-C data analyses standards, which will address this issue, among others. Though, it will take some time to produce a document. Meanwhile, I'd recommend my all-time favorite guideline to Hi-C data analysis by Noam Kaplan and Bryan Lajoie from Dekker's group. It discusses the non-informative artifacts of Hi-C, though doesn't suggest any particular threshold.

ADD REPLY • link updated 7.6 years ago by PoGibas 5.1k • written 7.6 years ago by Anton Goloborodko ▴ 280

0

Entering edit mode

Thank you for your ideas!
I remembered your suggestion to remove #1 & #2 diagonal, which sounds very logical to me. But still, shouldn't we, in ideal world, remove diagonal bias while normalising interaction matrices (i.e., O / E, should remove super and sub-diagonal bias). This is not a question, just random observation that should be tested :)

ADD REPLY • link 7.5 years ago by PoGibas 5.1k

score 0 · Answer 2 · 2016-08-24

0

Entering edit mode

7.7 years ago

Nibua ▴ 70

What's the size of the bins in you matrix? Let's say the bins are 5kb long. The values in the diagonal means that you see interactions at very short distance (less than 5kb). To me these values should be the highest values.

ADD COMMENT • link 7.7 years ago by Nibua ▴ 70

0

Entering edit mode

Thanks for the answer. My bin size is 40 kb. You are correct that diagonal should be the highest value. But for finding the true contact pairs should i use diagonal value as it is or should i put 0 in diagonal. I have a confusion about that.

ADD REPLY • link 7.7 years ago by ankitplusplus ▴ 40

0

Entering edit mode

What is the method you will use to find the best contact pairs?

ADD REPLY • link 7.7 years ago by Nibua ▴ 70

0

Entering edit mode

I am calculating probability of contacting pair using zhang and wolynes 2015 PNAS method p(i,j) =min(1,C(i,j)/min(ni,nj)) where ni = max[ C(i-4, i), C(i-3,i), ..., C(i,i+2), C(i,i+3) ] C(i,j) is the contact frequency of pairs. So, i have a doubt that what value of C(i,i) should i used for calculating probability p(i,j).

ADD REPLY • link 7.7 years ago by ankitplusplus ▴ 40

score 0 · Answer 3 · 2016-08-25

So you are estimating contacts between fragments. But obviously, fragments which are already very close together (i.e. in the same bin) are highly likely in close proximity with each other. And that's not exactly what you are looking for, probably.

You probably want to check for secondary structure interactions such as looping, not for interactions because of the primary sequence - proximity.