15 months ago

Seattle, WA USA

In the case of your two snippets, `NAN`

s will be returned because there are no overlaps between elements in the reference and map sets.

More generally, however, the `--mean`

operator is simply applied on the score data in map elements, without consideration for the length of the overlapping elements.

Consider the following ad-hoc example `reference.bed`

:

```
chr1 100 200
chr1 200 300
chr1 300 400
```

And the example `map.bed`

with two elements, one which is 15 bases long, and another 30 bases:

```
chr1 120 135 . 100
chr1 150 180 . 50
```

We can use `--mean`

with `--echo-map-score`

to clearly show how the statistic is calculated:

```
$ bedmap --echo --mean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|100.000000;50.000000
chr1 200 300|NAN|
chr1 300 400|NAN|
```

There are two scores reported back over the reference element `chr1:100-200`

, which are `100`

and `50`

, referring back to the elements in the map set. The average ("arithmetic mean") of those two scores is `75`

. The lengths of overlaps are not used to calculate the statistic.

The other two reference elements report a mean of `NAN`

, as there are no elements in `map.bed`

that overlap those reference elements. You can use `--skip-unmapped`

to limit the result to only include reporting statistics where there are overlaps:

```
$ bedmap --skip-unmapped --echo --mean ref.bed map.bed
chr1 100 200|75.000000
```

If you need to take overlap length into account, one approach is to use the `--wmean`

option to apply a weighted arithmetic mean ("weighted mean").

Weights are derived from the extent of overlap of a map element with the reference element, relative to the overlaps of other map elements.

Let's go back to the demo, adding the `--wmean`

operator:

```
$ bedmap --echo --mean --wmean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|66.666667|100.000000;50.000000
chr1 200 300|NAN|NAN|
chr1 300 400|NAN|NAN|
```

The first reference element returns a weighted mean of `66.667`

. The other two reference elements return unweighted and weighted arithmetic means of `NAN`

s (as expected).

If we go back to the lengths of overlaps between map and reference elements, we have a 15-base overlap with a score of 100, and a 30-base overlap with a score of 50.

The 30-base overlap is twice as long as the 15-base overlap, so the 30-base score is given twice the "weight" of the 15-base score:

```
(100 + 2*50) / (1 + 2*(1)) = (100 + 50 + 50) / 3 = 66.667
```

See `--help`

for a brief description of this option, as well as other statistics (`--median`

, `--stdev`

, `--min`

, `--max`

, etc.).

The correct statistic to use will, of course, depend on your input sets, the signal you are measuring, and whether you need to account for the length of overlaps between reference and map sets.

Another option is to chop up the map and reference sets into bins of equal size, and to interpolate or "zero" signal in the map set, where there are gaps. Tools like `bedops --chop`

and `awk`

can help here. The goal of using bins of equal size, regardless of the original element lengths, would be to have the mean signal (or whatever statistic) over all reference elements be calculated from vectors of equal lengths.

These and other normalization approaches can be useful for making sure that a calculated statistic doesn't unnecessarily derive from bias to one or more subsets, but this depends what you're trying to measure, ultimately.

Thanks a lot for such detailed explanation :).