Filter for regions containing large stretches of missing data
2
0
Entering edit mode
6.9 years ago
mmmmcandrew ▴ 190

Hey all-

I have bigwig files containing scores for each base pair of the genome where I have sequencing coverage. I have pretty good coverage, but I have recently made some heatmaps aligning this data around TSS (+/- 3kb) and set the missing data color to yellow. For the vast majority of these 6 kb regions, I have very little or no yellow (missing) data. However, there are a few cases where I have large stretches of missing data within these 6 kb regions.

Is there a quick and easy way to apply a threshold for the tolerance of missing data? For instance, I would like to say "If there is a 500 bp stretch of missing data, remove this region from consideration." Any ideas?

Thanks!

deeptools computematrix plotheatmap missing data • 1.7k views
ADD COMMENT
2
Entering edit mode
6.9 years ago

We don't have a method to do that built into deepTools. One would need to program another script to filter this, which wouldn't be trivial without a for loop.

ADD COMMENT
2
Entering edit mode
6.9 years ago

Merge the per-base signal (scores):

$ bedops --merge per-base-signal.bed > merged-pbs.bed

Invert the merged regions to get the "mirror" image. This "mirror" is the set of gaps in between merged stretches of signal.

Assuming hg38, say:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' > hg38.bed
$ bedops --difference hg38.bed merged-pbs.bed > gaps-pbs.bed

This file gaps-pbs.bed contains the gaps between merged stretches of signal, as measured over the bounds of reference genome hg38.

Now map your TSS windows to those gaps with bedmap, using the --echo-map-size option to print out all the sizes of mapped elements (gaps). Pass this result to awk to test if any of the gap sizes are larger than 500 bases.

If any such gap qualifies, just skip the current line and go to the next line. If no such gap is found, then the TSS window is "okay" and gets written to standard output:

$ bedmap --echo --echo-map-size --fraction-map 1 tss-windows.bed gaps-pbs.bed | awk -F'|' '{ flag=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if (a[i] >= 500) { flag=1; break; } } if (flag==0) { print $1; } }' > filtered-tss-windows.bed

Using --fraction-map 1 ensures that any overlapping gap will be entirely contained within the TSS window. Gaps which fall off the edge of the window are not considered. This setting can be adjusted to be more lenient, if desired.

The file filtered-tss-windows.bed should now contain all TSS windows that do not entirely contain a gap greater than or equal to 500 bases in length.

ADD COMMENT
0
Entering edit mode

The input file is a gzipped matrix with a json header, I don't think bedops will work for that.

ADD REPLY
0
Entering edit mode

I'm just explaining one way to do things, if your inputs can be set up correctly. You may need to convert files to sorted BED with bigWigToWig, wig2bed, jq, sort-bed etc. to do set operations.

ADD REPLY
0
Entering edit mode

Thanks very much guys!

ADD REPLY

Login before adding your answer.

Traffic: 2254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6