Question

bedtools coverage bed

0

Entering edit mode

6.3 years ago

bioguy24 ▴ 230

The bedtools 2.25 command I use for coveregeBed is:

coverageBed -d -g genomefile.txt -a target.bed -b input.bam > output.txt

That works for the vast majority of bam files, however when each bam exceeds 60 million mapped reads (~60GB) the run errors. I have been reading about recent memory leak fixes and plan to upgrade to bedtools 2.27. Is there a better command to use or is upgrading enough and re-running? Thank you :).

NGS bedtools • 5.4k views

ADD COMMENT • link updated 6.3 years ago by Alex Reynolds 35k • written 6.3 years ago by bioguy24 ▴ 230

1

Entering edit mode

My approach in this case, assuming I didn't have better hardware available, would be to extract the regions of the bam using the bedfile and samtools view -L. I would then supply that extracted BAM to bedtools. That will give you a much smaller BAM and may help avoid the memory issues you're seeing. I would spot-check a few of the regions in IGV to be 100% sure I'm not losing reads at the interval boundaries.

ADD REPLY • link 6.3 years ago by Dan D 7.4k

0

Entering edit mode

One "hack" (I hesitate to call it a solution) is to use samtools depth. This command will print out the number of reads spanning a nucleotide for a genome or a given region (samtools depth in.bam -r chr1:1-1000000). However, samtools depth does not give you the same output as coverageBed, but will give you coverage estimates

ADD REPLY • link 6.3 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

score 0 · Answer 1 · 2018-01-09

If target.bed is three-column, you could do something like:

$ bedmap --echo --count --bases --echo-ref-size target.bed <(bam2bed input.bam) | awk '{ print $0"\t"($5/$6); }' > answer.bed

The --echo option reports the elements in target.bed.

The --count option reports the number of features in the map file (the BAM file) that overlap the reference file (the target.bed file) by one or more bases.

The --bases option reports the number of bases that overlap between target and BAM reads.

The --echo-ref-size reports the length of the target element (for each target element).

Piping to awk does the same calculations as coverageBed.

This should ramp up to whole-genome scale work.

To make things faster, if the BAM file is indexed, you can add --do-not-sort and --reduced options to bypass sorting with sort-bed and to only print a subset of columns as BED.