Question

Probability of overlap between sequences?

0

Entering edit mode

9.3 years ago

Rich ▴ 40

Hi,

I'm sorry if this question is too naive but maybe you can advise me the right way to solve a problem.

I have a 50 kb region on chromosome 6 (length 171,115,067 bp). It turned out to be, that this region completely overlaps with one specific site (5 kb). However, there are about 250 such specific sites across the entire chr 6.

How can I calculate chances that such overlap between my 50 kb region and one of such 5 kb sites happens just randomly? Is there any simple formula? Or I need to generate randomly 10 or 100 thousands of 50 kb regions from chr 6 and then calculate how many times such randomly generated regions overlap with the 5 kb sites?

And if the later way is the right one is there any tool that could generate such sequences from the given human chromosome?

Thank you!

sequencing probability • 3.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Rich ▴ 40

0

Entering edit mode

Random sampling would be a good way to generate your "background" or expected rate of overlap. You might be careful how you generate your sample space, in that you might exclude sampling from certain areas (unmappable regions, say).

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you, Alex! Do you know any tool (maybe in BioPerl or etc.) that could generate it based on real human chromosomes?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Rich ▴ 40

0

Entering edit mode

9.3 years ago

dariober 14k

Hi- Have a look at these Q&A:

And this paper The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment.

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by dariober 14k

0

Entering edit mode

Thank you for the helpful links!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Rich ▴ 40

Ram · Accepted Answer · 2015-02-12

To get bounds for, say, hg19, you could do something like:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/chromInfo.txt.gz | gunzip -c - | awk '{print $1"\t0\t"$2}' - | sort-bed - > hg19.bounds.bed

You could use BEDOPS to subtract out unmapped regions:

$ wget -qO- https://gist.githubusercontent.com/alexpreynolds/6126976/raw/4bfdd68c0cc45c4c1b0643ea16c8630364d2f316/hg19.gaps.bed > hg19.gaps.bed
$ bedops --difference hg19.bounds.bed hg19.gaps.bed > hg19.mappable.bed

A sampler that goes through hg19.mappable.bed could be written in a scripting language of your choice. I don't think it would be too hard to sample uniform (or other distribution) starting points within the given, mappable genomic spaces, adding 50K to that point to generate a random interval. You could do read.table() in R, for instance, to bring in the mappable spaces to do sampling there. There are lots of options - too many to enumerate here.

Your script determines how many random intervals you sampled (10K, 100K, whatever). Spit all these random intervals out to a sorted BED file, and then do something like bedops --element-of 1 random_intervals.bed 5kb_sites.bed | wc -lto get counts of overlapping random intervals.

Divide the two numbers to get an expected rate of overlap. Repeat this to build a population of expected or background rates to compare against your observed rate. You might generate a z-score, for example, to indicate how near or far away your observed rate is from what is expected.