Question

D_Js Algorithm For Identifying Genomic Domains Of Distinct Nucleotide Composition

1

Entering edit mode

11.2 years ago

Daniel Standage 4.1k

The D_JS algorithm was first proposed by Cohen et al^[1] as a method to identify large regions of distinct nucleotide composition (isochores) within eukaryotic genomes. This algorithm performed well in relation to alternative isochore prediction methods in subsequent comparisons^[2].

I'm implementing the D_JS algorithm from the description provided by Cohen et al^[1] (second subsection under Methods).

In this procedure, the chromosomes are recursively segmented by maximizing the difference in GC content between adjacent subsequences. The process of segmentation is terminated when the difference in GC content between two neighboring segments is no longer statistically significant.

Briefly, a chromosome of length L, GC content F_GC, and AT content F_AT = 1 − F_GC is divided into two contiguous segments (i = 1, 2) of length l_i, GC content fⁱ_GC and AT content fⁱ_AT. These segments are chosen to maximize the Jensen-Shannon entropic divergence measure, D_JS, defined as the difference between the overall Shannon entropy H^tot and the sum of segment Shannon entropies Hⁱ:

where Hⁱ = -fⁱ_GC log2 fⁱ_GC - fⁱ_AT log2 fⁱ_AT and H^tot = -F_GC log2 F_GC -F_AT log2 F_AT. The segmentation is then repeated recursively for each segment until a halting criterion D_JS ≥ D_C is met for all segments.

I fear I might have misinterpreted the intent of this section in my initial implementation. On each recursive call to the segmentation procedure, I am recalculating H^tot for that particular segment, and then selecting two subsegments to maximize D_JS within the scope of that segment. When I first read "halting criterion D_JS ≥ D_C is met for all segments", this is what I envisioned. However, perhaps "for all segments" means that a single D_JS value is calculated across all segments, rather than a distinct D_JS value being calculate for each segment separately. The use of summation notation definitely supports this idea.

Should I be calculating H^tot once for the entire sequence and then calculating a single D_JS value after each round of segmentation?

algorithm dna • 2.3k views

ADD COMMENT • link 11.2 years ago by Daniel Standage 4.1k