Question

Read depth and Hi-C resolution

8

Entering edit mode

8.3 years ago

int11ap1 ▴ 470

Dear all,

I always read that the minimum resolution someone can obtain with Hi-C data depends on the number of reads or the read depth of your dataset. According to a review written by Ferhat Ay and William S. Noble called "Analysis methods for studying the 3D architecture of the genome":

Even though there are no clear guidelines yet, a recent study suggests using a bin size that results in at least 80 % of all possible bins having more than 1,000 contacts [18]. According to this criterion, approximately 300 million mapped reads are needed to achieve 10 kb resolution for the human genome, assuming that all reads are uniformly distributed across the genome. However, this criterion suggests a linear relationship between resolution and sequencing depth, which does not hold for two-dimensional Hi-C data. An alternative would be to use a similar cutoff-based measure on the density of either the cis- or the trans-contact matrices instead of total contact counts per locus.

Let's say I have a dataset having X reads in a genome of length Y. How would I know the minimum resolution I can obtain? I hardly understand it.

read depth coverage hi-c • 9.1k views

ADD COMMENT • link updated 8.1 years ago by Anton Goloborodko ▴ 280 • written 8.3 years ago by int11ap1 ▴ 470

score 3 · Answer 1 · 2016-03-24

Hi,

the resolution of a Hi-C dataset (a) highly depends on the details of the protocol and (b) actually, is a poorly defined, though real measure.

(a) the number of reads is not the only variable that determines the quality of a Hi-C dataset. First, if the total number of unique Hi-C molecules in your sample is low (as they say, "a library has low complexity"), nothing good can come out of extra sequencing. Second, informative Hi-C molecules (i.e. molecules formed by ligation between two loci spatially close in a nucleus) represent only a fraction of any Hi-C library; any library also has molecules formed by random ligations in solution, fragments of unligated DNA and pieces of self-circularized DNA. Thus, depending on the protocol and particular execution, two sequenced libraries with the same number of reads may contain different number of informative interactions. Even worse, randomly ligated molecules cannot be computationally removed from a library and, if present in large numbers, may completely mask true interactions, especially at large distances or in trans. Third, any Hi-C experiment cannot achieve the resolution higher than a few average restriction fragments. In particular, the commonly used HindIII is a 6bp-cutter and on average cuts every ~3kb; thus, any experiment using HindIII has a maximal resolution of ~10kb.

See a more detailed discussion in this review by Dekker's lab.

(b) The issue is complicated by the fact that, AFAIK, there is no published, publicly accepted definition of a resolution of a Hi-C dataset/protocol. This is not to say that all datasets are created equal - the original Hi-C dataset from 2009 could not really distinguish TADs, and it took a high read depth dataset produced using a modified protocol by Lieberman-Aiden's lab in late 2014 to clearly distinguish individual loop interactions at corners of TADs. Now, the classic definition of resolution is the minimal size of a feature that the method can distinguish. For Hi-C, it's not entirely clear what specific fine features are expected to be seen in any specific sample; without knowing such features one cannot tell the resolution of a chosen dataset. Obviously, it is not an unsolvable problem, but officially it has not been solved yet. There are some standardization efforts currently undertaken by the freshly formed 4D Nucleus consortium, so hopefully in two-three years the definition of resolution in Hi-C will be more clear.

Ram · Answer 2 · 2016-01-20

I think the text is saying that if you divided the human genome (roughly 3.5 billion bp) into 10kb windows (non overlapping sequence of which there are roughly 350,000). Then to get at least 1000 reads (contacts) in each window you would need 300million reads. Assuming those reads are evenly distributed over the genome, which they probably will not be due to the underlying sequence.