Extract Alignment From Very Large Bam File
6
5
Entering edit mode
11.9 years ago
Plantae ▴ 390

I have bam files ~50Gb, which is read alignments onto a reference genome. now I want to extract some regions of alignment, but I found methods like samtools view -L region.bed all.bam

is too slow, usually taken more than 2 hours (even if very small region are submit)

Are there any other methods that can do this job faster?

bam • 30k views
ADD COMMENT
0
Entering edit mode

Can you paste the header of your bam file (with samtools view -H file.bam)

ADD REPLY
6
Entering edit mode
11.9 years ago
Bpow ▴ 280

Extracting small regions from a bam file should be sub-seconds fast if the files are properly indexed. You don't say what version of samtools you are using (and I don't know if this is still an issue), but I recall that at least some point there was an issue where specifying the region on the command line

samtools view -h reads.bam 1:1042000-1042010

would use the index while specifying the -L option and providing a bed file of intervals would result in the entire file being read. You may try specifying an interval at the command line to see if that helps.

ADD COMMENT
0
Entering edit mode

+1. Exactly what I thought as well. That's why I asked for the header to see if its coordinate sorted. Does samtools run on bam files that are not coordinate sorted? If so, probably the index is not of help?

ADD REPLY
0
Entering edit mode

I indexed the bam file, then use your method. It is really very fast, much faster than bedtool

ADD REPLY
0
Entering edit mode

Plantae, in that case, do you mind accepting the answer? :)

ADD REPLY
0
Entering edit mode

For only one position, it's the fastest way to do!

ADD REPLY
5
Entering edit mode
11.9 years ago

BEDtools is a good tip, but I would recommend intersectbed here:

intersectBed -abam reads.bam -b regions.bed > reads.regions.bam

or, if you want output in BED format

intersectBed -abam reads.bam -b regions -bed > reads.regions.bed
ADD COMMENT
0
Entering edit mode

intersectBed -abam reads.bam -b regions.bed > reads.regions.bam will report original reads right ? How can i clip this reads to get just the part that overlap my intervals in the bed file ? thanks

ADD REPLY
4
Entering edit mode
9.6 years ago

See also this question & answer where @matted and myself compared samtools view -L with samtools view <region>. The latter is much faster. If you have several regions in file references.txt use (credit @matted):

cat references.txt | xargs samtools view -b input.bam > output.bam
ADD COMMENT
3
Entering edit mode
9.6 years ago
Hypotheses ▴ 90

My two cents:

samtools view -h reads.bam 1:1042000-1042010

If that doesn't work. try this:

samtools view -h reads.bam chr1:10420000-10421000

To output to another bam file format

samtools view -b reads.bam chr1:10420000-10421000 > subset.bam
ADD COMMENT
0
Entering edit mode
11.9 years ago

Use Bedtools, it accepts BAM files!

bamToBed -i reads.bam > reads.bed

http://code.google.com/p/bedtools/wiki/Usage

ADD COMMENT
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6