Question

bedtools shuffle vs. random

0

Entering edit mode

7.1 years ago

blur ▴ 280

Hi! I want to create a random list of bed locations to use and see if the intersection between my dataset and a dataset from a paper I read is significant. I am not sure which tool is more logical for what I want to do: shuffle or random? I want to use the end of genes, is there a big difference if I create a specific file to use as genome in random vs. if I add an incl file to shuffle?

Thank you for your help!

bedtools • 5.9k views

ADD COMMENT • link updated 7.1 years ago by bernatgel ★ 3.4k • written 7.1 years ago by blur ▴ 280

score 3 · Answer 1 · 2017-03-09

Well random will create random locations of a particular length and shuffle locations that will be length matched an input bed file. One downside of random is that strand locations will also be random, which means that if there is some strand bias on your experimentally derived data, you might get significant differences that are not there.

I want to use the end of genes

Will these be defined as regions X bp from transcription termination site, that is all have the same size? If not, and in view of the strand issues, my advise would be to use shuffle. It will alleviate strand bias issues, and allow more control over what your control regions look like - that is they will be more matched to your locations.

Also, generate the control set multiple time (say 1000) to perform the comparison multiple times - you can then calculate the average and standard deviation of those 1000 permutations. This will ensure that the effect that you see (or not) is stable.

score 0 · Answer 2 · 2017-03-09

0

Entering edit mode

7.1 years ago

bernatgel ★ 3.4k

If you can use R, you could user regioneR to test if there is a significant overlap between your dataset and the one from the publication. It will perform the whole process explained by @fridaymeetssunday: randomization a number of times (1000), computing the overlaps, the mean and standard deviation and finally answer with a p-value, a z-value (and a plot if you need it).

The package has different options and parameters and you should select the randomization strategy according to your needs (if working with genes, probably resampling instead of randomizing completely, or restricting the randomization space with a stringent mask). In the package vignette you can find more information and examples.

NOTE: right now regioneR's randomization is not strand specific, so you should take this into account if you need strand specific random regions.

ADD COMMENT • link 7.1 years ago by bernatgel ★ 3.4k

0

Entering edit mode

I was not aware if this package. From a (very) brief read looks very useful.Thanks.

Edit: the creation of random regions appears to be strand-agnostic. Is this correct?

ADD REPLY • link 7.1 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

Yes, I forgot to add this to my answer. Strand specific randomization is in the pipeline but not ready yet.

It is possible to do it in a strand specific way right now by defining a custom randomization function that internally randomizes separately according to strand. If you think you strand specific randomization would be an important feature for you, please contact me and we'll try to speed it up.

ADD REPLY • link 7.1 years ago by bernatgel ★ 3.4k

0

Entering edit mode

Sorry for the delay in answering. I am not interested in randomization according to strand for any specific purpose at the moment, I just thought this would be an important feature missing the package. For instance the OP's data is strand biased, and this is not, in my experience uncommon.

ADD REPLY • link 7.1 years ago by A. Domingues ★ 2.7k