Question

Sampling protein coding genes of the same length distribution as another set of elements using R (GRanges)

0

Entering edit mode

8.4 years ago

Dimitris Polychronopoulos • 0

Hello,

I have a set of elements with the following distribution of lengths:

summary(width(positivelincrnas))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    470    4164    9872   18940   20790  152600

and another dataset with the following distribution:

summary(width(positivegeneshg19))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     20    5558   20460   59880   58360 4829000

I would like to get elements from the second dataset (genes) such that they are of the same length distribution as the first set of elements (lincrnas). Both objects are GRanges objects.

Any suggestions?

Thanks a lot,
Dimitris

R GRanges • 1.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Dimitris Polychronopoulos • 0

0

Entering edit mode

See also https://support.bioconductor.org/p/74583/

ADD REPLY • link 8.4 years ago by Julian Gehring ▴ 20

score 1 · Answer 1 · 2015-11-15

In order to match the length distributions, you can compute a density estimate from the first data set and sample from the second data set considering that density. Let's assume we have two GRanges object: gr1 (positivelincrnas) and gr2 (positivegeneshg19). The trick here is to use a weighted sampling scheme where the probability is derived from the distribution of the first dataset.

bins = seq(1000, 25000, by = 1000) ## choose according to your dataset
h = hist(width(gr1), bins, plot = FALSE)
idx = cut(width(gr2), bins, labels = FALSE)
gr2matched = sample(gr, final_size, prob = h$density[idx]) ## adjust the 'size' and 'replace' arguments