Question

converting bed file to heatmap data

0

Entering edit mode

9.5 years ago

arnstrm ★ 1.8k

Hi,

I was wondering if there are any utilities that can convert the bed file to heatmap data, which I can use it for plotting.

For example, if I have a bed file:

10      3843    4059
10      6964    7383
10      17670   17855
10      21897   22762
10      71880   72558
10      73454   74055
10      82310   82654
10      82796   85155
10      85306   86100
10      160810  161076

I want to convert it to number of bases per interval, like:

10    1       500000  10.8223
10    500001  1000000 22.0333
10    1000001 1500000 44.4082
10    1500001 2000000 49.7916
10    2000001 2500000 53.1821
10    2500001 3000000 39.134
10    3000001 3500000 33.7936
10    3500001 4000000 12.4694
10    4000001 4500000 57.7986
10    4500001 5000000 66.1606

where the 4th filed is percent of bases covered per 500000 (or other values).

Thanks for any suggestions!

bed heatmap genome-coverage • 4.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by arnstrm ★ 1.8k

1

Entering edit mode

have a look at bedtools.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Ming Tommy Tang ★ 3.9k

0

Entering edit mode

I actually looked at it, but there are several of them (tools). Can you give me an overview/idea for how to use bedtools for this? Thanks!

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by arnstrm ★ 1.8k

0

Entering edit mode

How do you need to handle the case where an input range straddles bounds between two adjacent 500k intervals?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for the reply! I want to simply calculate total number of bases that are within 500000 interval and calculate the percentage. Anything beyond that 500000 should be added to the subsequent interval.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by arnstrm ★ 1.8k

2

Entering edit mode

9.5 years ago

Ram 43k

This level of customization in requirements can best be addressed by a custom Python script. Logic to build such a script:

Set start = 1
Set end = block_size
Set sum = fringe (if fringe exists and is >0)
Set fringe = 0
For each BED line
- read BED line into array
- if array[1] > stop, set perc[stop] = sum/block_size, start = end+1 and end = end+block_size
- else if array[2] > stop, set sum += end - (array[1] - 1) and fringe = array[2] - (end - 1), start = end+1 and end = end+block_size
- else if array[2] < stop, sum += array[2] - (array[1] - 1)

In the end , the perc dictionary will have the percentages you need.

ADD COMMENT • link 2.2 years ago by Ram 43k

0

Entering edit mode

This of course assumed that each block has some data in the BED file. You might wanna address scenarios where an entire block is skipped. You can do that by adding a condition to check if array[1] > stop && array[1] < stop + block_size

I'd do it but I don't wanna end up writing the program for you :-)

ADD REPLY • link 2.2 years ago by Ram 43k

Ram · Accepted Answer · 2014-11-06

Make 500Kb "windows" tiling the genome. As an example - it looks like you already have your files ready.

bedtools makewindows -w 500000 -g genomes/human.hg19.genome > hg19.500kb.bed

Create a random BED file of 1,000,000 intervals each 100bp in size. As an example - it looks like you already have your files ready.

bedtools random -n 1000000 -l 100 -g genomes/human.hg19.genome > random.bed

Calculate the coverage of the random BED records on the 500Kb windows. The first 3 columns are the window itself, the 4th is the count of BED records that overlapped the window, the fifth is the total number of base pairs in the window with coverage, the 6th is the size of the window, and the 7th is what you are after: the fraction of bases in the window with coverage.

bedtools coverage -a random.bed -b hg19.500kb.bed \
  | sort -k1,1 -k2,2n \
  | awk '$1=="chr10"' \
  | head

chr10    0    500000    162    15947    500000    0.0318940
chr10    500000    1000000    143    14136    500000    0.0282720
chr10    1000000    1500000    147    14487    500000    0.0289740
chr10    1500000    2000000    157    15473    500000    0.0309460
chr10    2000000    2500000    164    16252    500000    0.0325040
chr10    2500000    3000000    160    15501    500000    0.0310020
chr10    3000000    3500000    164    16322    500000    0.0326440
chr10    3500000    4000000    165    16178    500000    0.0323560
chr10    4000000    4500000    150    14931    500000    0.0298620
chr10    4500000    5000000    169    16542    500000    0.0330840

Ram · Accepted Answer · 2014-11-06

BEDOPS would work well for this, along with standard UNIX utilities.

Let's start with an extents file that defines the bounds of our chromosomes, say for hg19:

We'll grab chromosome 10 and reformat the chromosome name to match your chromosome naming scheme:

$ grep chr10 hg19.extents.bed | awk '{print "10\t"$2"\t"$3}' > chr10.extents.bed

Next, we'll make a BED file of windows or bins from this chromosome, every 500 kb:

$ awk ' \
    BEGIN \
    { \
        binSize = 500000; \
        binIdx = 0; \
    } \
    { \
        chr = $1; \
        start = $2; \
        stop = $3; \
        for (binStart = start; binStart < (stop - binSize); binStart += binSize) { \
            print chr"\t"binStart"\t"(binStart + binSize)"\tbin-"binIdx; \
            binIdx++; \
        } \
    }' chr10.extents.bed \
    > chr10.bins.bed

Next, we use BEDOPS bedops --partition to split your input regions (regions.bed) that span across bins. We can chain together a series of BEDOPS operations that calculate the number of overlapping bases across now-correctly partitioned regions:

$ bedops --partition chr10.bins.bed regions.bed \
    | bedmap --echo --indicator --delim '\t' - regions.bed \
    | grep "1$" - \
    | cut -f1-3 - \
    | bedmap --echo --bases --delim '\t' chr10.bins.bed - \
    | awk -v binSize=500000 '{print $0"\t"($5/$binSize)}' - \
    > final_answer.bed

The file final_answer.bed contains all the bins in BED format, with six columns: bin chromosome, bin start, bin stop, bin index, the number of bases from regions.bed that overlap that bin, and the percentage of overlap.

The chain of commands shown above might look a little overwhelming, but you can use UNIX tee in between steps so you can examine the output being passed to the next command. This is a great way to learn the command line and particularly standard input and output management, which I think you'll find is a critical part of doing bioinformatics.

BEDOPS is really useful for this task - you can pass in standard input and route standard output between BEDOPS tools like bedops and bedmap, and UNIX utilities like grep, cut and awk. After you use BEDOPS for a while, it almost becomes an expressive language for dealing with set operations of this kind.

Finally, the --partition option in bedops is a powerful piece of kit, which lets you quickly cut input regions that span bins, among other uses.