Biostar Beta. Not for public use.
Question: Sampling protein coding genes of the same length distribution as another set of elements using R (GRanges)
0
Entering edit mode

Hello,

I have a set of elements with the following distribution of lengths:

summary(width(positivelincrnas))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    470    4164    9872   18940   20790  152600 

and another dataset with the following distribution:

summary(width(positivegeneshg19))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20    5558   20460   59880   58360 4829000 

I would like to get elements from the second dataset (genes) such that they are of the same length distribution as the first set of elements (lincrnas). Both objects are GRanges objects.

Any suggestions?

Thanks a lot,

Dimitris

ADD COMMENTlink 4.3 years ago Dimitris Polychronopoulos • 0 • updated 4.3 years ago Julian Gehring • 20
Entering edit mode
Entering edit mode
1
Entering edit mode

In order to match the length distributions, you can compute a density estimate from the first data set and sample from the second data set considering that density. Let's assume we have two GRanges object: gr1 (positivelincrnas) and gr2 (positivegeneshg19). The trick here is to use a weighted sampling scheme where the probability is derived from the distribution of the first dataset.

bins = seq(1000, 25000, by = 1000) ## choose according to your dataset
h = hist(width(gr1), bins, plot = FALSE)
idx = cut(width(gr2), bins, labels = FALSE)
gr2matched = sample(gr, final_size, prob = h$density[idx]) ## adjust the 'size' and 'replace' arguments
ADD COMMENTlink 4.3 years ago Julian Gehring • 20

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0