Question

extract reads from BAM whose fragments overlap a specified region

0

Entering edit mode

6.2 years ago

Richard ▴ 590

Hi all,

What method do you recommend for extracting from a BAM file all the fragments whose insert overlaps a desired coordinate range?

Using samtools/sambamba view I can easily extract all the sequenced reads that overlap a region. I'm looking for a command that will do the same, but also report any reads/fragments with long inserts that overlap the desired region, regardless of whether the sequenced ends of the fragment overlap the query region. In other words I want to capture the reads that would represent a deletion that spans my regions of interest.

I'm trying Bedtools but so far it seems quite as though it might be scanning the entire BAM. I'm looking for something that runs can query a region for overlapping fragments in a matter of a few seconds.

thanks RIchard

EDIT: It looks like BEDTools intersectBed will only report the sequenced reads that overlap a region, so reads spanning the region (that I'm looking for) are not reported

alignment • 3.9k views

ADD COMMENT • link updated 6.2 years ago by Pierre Lindenbaum 161k • written 6.2 years ago by Richard ▴ 590

score 0 · Answer 1 · 2018-03-23

Operations on whole-genome scale BED and BAM files can often be trivially parallelized by chromosome.

BEDOPS tools include routines for restricting work to one chromosome of records in a BED file, which can save a lot of time when used in conjunction with a compute cluster.

Here's one way to build up a pipeline for doing things one chromosome at a time with BEDOPS.

First, we convert the BAM file to BED. This conversion script is part of BEDOPS and uses a SLURM-based cluster, but BEDOPS includes equivalent scripts for SGE and GNU Parallel conversion:

$ bam2bed_slurm reads.bam reads.bed

Parallelization reduces conversion time to the largest chromosome (usually chr1).

Once parallelized conversion finishes, the following skeleton can be parallelized easily for each chromosome ${chr}:

$ chr="chr1"
$ bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed

Using --chrom ${chr} with bedmap restricts the set operation work to that chromosome, instead of reading through the entire file.

This is going to be great for what you're doing, because sequencing reads do not span chromosomes, and so we can map with bedmap on individual chromosomes.

Via bash, this for loop would process the work in serial, one chromosome at a time:

$ for chr in `bedextract --list-chr reads.bed`; do echo ${chr}; bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed; done

The bedextract --list-chr tool uses a binary search to delimit byte bounds between chromosomes in the input BED file. This application takes a fraction of a second to build a list of chromosomes. We can use that list in a for loop to submit per-chromosome jobs to a compute cluster.

For instance, via a SLURM job scheduler, we can fire off one job per chromosome:

$ for chr in `bedextract --list-chr reads.bed`; do echo ${chr}; sbatch --parsable --partition="some_queue" --mem-per-cpu=1g --job-name="overlaps_${chr}" --wrap="module add bedops && bedmap --chrom ${chr} --echo --echo-map --skip-unmapped regions.bed reads.bed > answer.${chr}.bed"; done

The time taken for all operations is reduced considerably, to basically only that needed by the largest chromosome.

To generate a collated result, we can use sbatch --parsable to get back the job IDs, and set up a collate job at the end with --dependency=afterok:<jobs> to run a bedops --everything union step at the end.

score 0 · Answer 2 · 2018-03-23

here is a solution using samjdk http://lindenb.github.io/jvarkit/SamJdk.html

The bed 'input.bed' is read at the first read, we check that the read intersects the [bed.start- extend, bed.start+extend] and the [bed.end- extend, bed.end+extend] . Feel free to change this criteria. For example, it could be

return readInterval.getContig().equals(I.getContig()) &&
       readInterval.getStart() < I.getStart() &&
       readInterval.getEnd() > I.getEnd()
       ;

and of course you can parallelise for each region. Using samtools upstream, it could be

echo -e "chr1\t1234\t456" > input.bed && samtools view -bu  input.bam "chr1:1234-456" | java -jar dist/samjdk.jar -f biostars305526.code --body