Output bed file region sizes as histogram
3
0
Entering edit mode
7.0 years ago
mmmmcandrew ▴ 190

Hey all-

I have a bed file containing millions of regions that range in size from 1 bp - ~3300 bp, but I would like to know the frequency distribution of these sizes. Note that the sizes themselves are not included in the bed file, it simply contains the coordinates.

Is there a tool that could take this bed file as the input and output a histogram (preferably an image file, though I'm not terribly picky) of size frequencies? I don't care whether binning is performed or not.

Thanks!

bed bedtools bedops histogram • 3.8k views
ADD COMMENT
2
Entering edit mode
7.0 years ago

using gnuplot

$ awk '{printf("%d\n",100*int((int($3)-int($2))/100.0));}' in.bed | \
   sort -n | uniq -c  |\
   awk 'BEGIN {printf("set terminal png; set output \"out.png\"; set xtics rotate; set yrange [0:]; set style data histogram; plot \"-\" using 1:xtic(2) title \"\"\n");} {print}' |\
   gnuplot
ADD COMMENT
0
Entering edit mode

Thanks Pierre! This works very nicely, but I think I am going to have to bin things a bit differently actually. At the moment, the data basically looks like a very single large bar over 0, with a very small bar at 100 and no visible bars from 200 - 3300. I've been trying to get a hold of some of the gnuplot documentation, but if you have a quick fix so that I could make my bin size smaller, that would be great.

ADD REPLY
0
Entering edit mode

like a very single large bar over 0,

in the awk , change '100' to a lower binning value... e.g:

10*int((int($3)-int($2))/10.0)
ADD REPLY
0
Entering edit mode

Ah, excellent, I'm not very familiar with awk so I wasn't sure where the binning value was. Thanks a lot!

ADD REPLY
0
Entering edit mode

Is there any way to change the scale of the x-axis? Say I want to focus in on sizes ranging from 1-300 bp or so. Again, apologies, I'm just unfamiliar with awk.

ADD REPLY
2
Entering edit mode
7.0 years ago
Ryan Dale 5.0k

If you have enough RAM and are OK with Python:

import pybedtools
from matplotlib import pyplot as plt
sizes = [len(i) for i in pybedtools.BedTool('my.bed')]
plt.hist(sizes)
plt.savefig('hist.png')
plt.show()
ADD COMMENT
0
Entering edit mode

At the "plt.hist(sizes)" step, this gives me the following error:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2954, in hist ax = gca() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 936, in gca return gcf().gca(*kwargs) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 586, in gcf return figure() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 535, in figure *kwargs) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 84, in new_figure_manager return new_figure_manager_given_figure(num, figure) File "/mnt/home/mcandr21/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 92, in new_figure_manager_given_figure window = Tk.Tk() File "/mnt/home/mcandr21/anaconda2/lib/python2.7/lib-tk/Tkinter.py", line 1820, in __init__ self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use) _tkinter.TclError: no display name and no $DISPLAY environment variable

I'm not great with python. Any ideas?

ADD REPLY
1
Entering edit mode

Sounds like you're running on a headless server with no X11 windows. In that case add these two lines to the very beginning:

import matplotilb
matplotlib.use('agg')

see http://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined for more details and https://matplotlib.org/faq/howto_faq.html#matplotlib-in-a-web-application-server for background

ADD REPLY
0
Entering edit mode

Thanks! This works great now.

ADD REPLY
2
Entering edit mode
7.0 years ago

From start to finish, this approach renders a plot for each chromosome, across all chromosomes in your regions of interest. Presumably it could be more interesting to know how overlaps are distributed within a chromosome.

As an example, we'll use reference genome hg38, along with the Kent tools kit, the BEDOPS toolkit, and R with ggplot2:

$ fetchChromSizes hg38 | awk '{print $1"\t0\t"$2;}' | sort-bed - > hg38.bed
$ bedmap --echo --echo-overlap-size --skip-unmapped hg38.bed myRegions.bed | cut -f1,4 | awk '{ n=split($2, a, ";"); for(i=1; i <= n; i++){ print $1"\t"a[i]; } }' > overlapSizes.txt
$ plotHistogram.Rscript --inFn=overlapSizes.txt --outFn=overlaps.pdf

The script plotHistogram.Rscript could use ggplot2 to render a column of per-chromosome histograms, and might look like this:

#!/usr/bin/env Rscript
library("optparse")
option_list = list(
  make_option(c("-i", "--inFn"), type="character", default=NULL, help="input file name", metavar="character"),
  make_option(c("-o", "--outFn"), type="character", default=NULL, help="output file name", metavar="character")
)
opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)
d <- read.table(opt$inFn, header=F, col.names=c("chr", "overlap"), sep="\t")
d$chr <- as.factor(d$chr)
library(ggplot2)
p <- ggplot(data = df, aes(x=overlap))
p <- p + geom_histogram(aes(fill=chr))
p <- p + scale_fill_brewer(palette="Set3")
p <- p + facet_wrap( ~ chr, ncol=1)
pdf(opt$outFn, width=5, height=30)
print(p)
dev.off()
ADD COMMENT

Login before adding your answer.

Traffic: 2483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6