Question

BEDOPS bedmap tool usage

0

Entering edit mode

5.3 years ago

bioinfo456 ▴ 150

How is the mean of signal intensity being calculated in a context where the first snippet is reference bed file and the second is the map bed file?

First snippet

chr1    1067800 1069801
chr1    1129419 1131420
chr1    1129716 1131717
chr1    1131195 1133196
chr1    1131481 1133482
chr1    1132502 1134503
chr1    1132965 1134966
chr1    1133221 1135222
chr1    1136117 1138118
chr1    1136870 1138871

Second snippet

chr1    777667  778084  Rank_36817  5.15510
chr1    819849  820212  Rank_107033 3.86633
chr1    823366  823645  Rank_201032 2.61252
chr1    825522  826055  Rank_83146  4.18852
chr1    829517  829733  Rank_177209 3.22194
chr1    905386  905631  Rank_206299 2.75724
chr1    906392  906695  Rank_115312 3.50921
chr1    920392  921210  Rank_19919  4.81348
chr1    921271  921830  Rank_53292  3.67739
chr1    923107  923763  Rank_23351  5.48918

To be more specific, is it alright to use bedmap in the context where the ranges are not continuous? If so, how exactly is the mean being calculated?

Thanks.

bedops bedmap dna sequence • 3.2k views

ADD COMMENT • link updated 5.3 years ago by Alex Reynolds 35k • written 5.3 years ago by bioinfo456 ▴ 150

score 4 · Accepted Answer · 2018-12-22

In the case of your two snippets, NANs will be returned because there are no overlaps between elements in the reference and map sets.

More generally, however, the --mean operator is simply applied on the score data in map elements, without consideration for the length of the overlapping elements.

Consider the following ad-hoc example reference.bed:

chr1    100     200
chr1    200     300
chr1    300     400

And the example map.bed with two elements, one which is 15 bases long, and another 30 bases:

chr1    120 135 .   100
chr1    150 180 .   50

We can use --mean with --echo-map-score to clearly show how the statistic is calculated:

$ bedmap --echo --mean --echo-map-score ref.bed map.bed
chr1    100 200|75.000000|100.000000;50.000000
chr1    200 300|NAN|
chr1    300 400|NAN|

There are two scores reported back over the reference element chr1:100-200, which are 100 and 50, referring back to the elements in the map set. The average ("arithmetic mean") of those two scores is 75. The lengths of overlaps are not used to calculate the statistic.

The other two reference elements report a mean of NAN, as there are no elements in map.bed that overlap those reference elements. You can use --skip-unmapped to limit the result to only include reporting statistics where there are overlaps:

$ bedmap --skip-unmapped --echo --mean ref.bed map.bed
chr1    100 200|75.000000

If you need to take overlap length into account, one approach is to use the --wmean option to apply a weighted arithmetic mean ("weighted mean").

Weights are derived from the extent of overlap of a map element with the reference element, relative to the overlaps of other map elements.

Let's go back to the demo, adding the --wmean operator:

$ bedmap --echo --mean --wmean --echo-map-score ref.bed map.bed
chr1    100 200|75.000000|66.666667|100.000000;50.000000
chr1    200 300|NAN|NAN|
chr1    300 400|NAN|NAN|

The first reference element returns a weighted mean of 66.667. The other two reference elements return unweighted and weighted arithmetic means of NANs (as expected).

If we go back to the lengths of overlaps between map and reference elements, we have a 15-base overlap with a score of 100, and a 30-base overlap with a score of 50.

The 30-base overlap is twice as long as the 15-base overlap, so the 30-base score is given twice the "weight" of the 15-base score:

(100 + 2*50) / (1 + 2*(1)) = (100 + 50 + 50) / 3 = 66.667

See --help for a brief description of this option, as well as other statistics (--median, --stdev, --min, --max, etc.).

The correct statistic to use will, of course, depend on your input sets, the signal you are measuring, and whether you need to account for the length of overlaps between reference and map sets.

Another option is to chop up the map and reference sets into bins of equal size, and to interpolate or "zero" signal in the map set, where there are gaps. Tools like bedops --chop and awk can help here. The goal of using bins of equal size, regardless of the original element lengths, would be to have the mean signal (or whatever statistic) over all reference elements be calculated from vectors of equal lengths.

These and other normalization approaches can be useful for making sure that a calculated statistic doesn't unnecessarily derive from bias to one or more subsets, but this depends what you're trying to measure, ultimately.