15 months ago
Seattle, WA USA
In the case of your two snippets,
NANs will be returned because there are no overlaps between elements in the reference and map sets.
More generally, however, the
--mean operator is simply applied on the score data in map elements, without consideration for the length of the overlapping elements.
Consider the following ad-hoc example
chr1 100 200
chr1 200 300
chr1 300 400
And the example
map.bed with two elements, one which is 15 bases long, and another 30 bases:
chr1 120 135 . 100
chr1 150 180 . 50
We can use
--echo-map-score to clearly show how the statistic is calculated:
$ bedmap --echo --mean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|100.000000;50.000000
chr1 200 300|NAN|
chr1 300 400|NAN|
There are two scores reported back over the reference element
chr1:100-200, which are
50, referring back to the elements in the map set. The average ("arithmetic mean") of those two scores is
75. The lengths of overlaps are not used to calculate the statistic.
The other two reference elements report a mean of
NAN, as there are no elements in
map.bed that overlap those reference elements. You can use
--skip-unmapped to limit the result to only include reporting statistics where there are overlaps:
$ bedmap --skip-unmapped --echo --mean ref.bed map.bed
chr1 100 200|75.000000
If you need to take overlap length into account, one approach is to use the
--wmean option to apply a weighted arithmetic mean ("weighted mean").
Weights are derived from the extent of overlap of a map element with the reference element, relative to the overlaps of other map elements.
Let's go back to the demo, adding the
$ bedmap --echo --mean --wmean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|66.666667|100.000000;50.000000
chr1 200 300|NAN|NAN|
chr1 300 400|NAN|NAN|
The first reference element returns a weighted mean of
66.667. The other two reference elements return unweighted and weighted arithmetic means of
NANs (as expected).
If we go back to the lengths of overlaps between map and reference elements, we have a 15-base overlap with a score of 100, and a 30-base overlap with a score of 50.
The 30-base overlap is twice as long as the 15-base overlap, so the 30-base score is given twice the "weight" of the 15-base score:
(100 + 2*50) / (1 + 2*(1)) = (100 + 50 + 50) / 3 = 66.667
--help for a brief description of this option, as well as other statistics (
The correct statistic to use will, of course, depend on your input sets, the signal you are measuring, and whether you need to account for the length of overlaps between reference and map sets.
Another option is to chop up the map and reference sets into bins of equal size, and to interpolate or "zero" signal in the map set, where there are gaps. Tools like
bedops --chop and
awk can help here. The goal of using bins of equal size, regardless of the original element lengths, would be to have the mean signal (or whatever statistic) over all reference elements be calculated from vectors of equal lengths.
These and other normalization approaches can be useful for making sure that a calculated statistic doesn't unnecessarily derive from bias to one or more subsets, but this depends what you're trying to measure, ultimately.