Question

Total length covered by BAM alignment (or BED file)

1

Entering edit mode

9.7 years ago

rwn ▴ 590

Hello,

I'm just wondering if there is a quick and simple way of simply getting the 'length' of a BAM alignment? By length, I just mean the number of nucleotides of the reference genome that are covered by an aligned read. Maybe there is something in BedTools that might be able to do it from the BED perspective?

Thought it might be worth asking before writing some perl that could do it...

Cheers!

Perl Bed Bam • 7.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by rwn ▴ 590

Ram · Answer 1 · 2014-07-22

2

Entering edit mode

9.7 years ago

John 13k

"samtools depth" or "samtools mpileup" is what you need :)

Then count all rows with a value of 0, and divide by the count of all rows to get a coverage percentage :]

But before you ask what exactly are you counting per-base: I have no idea. I just posted that question 5 seconds ago, so you might want to watch it ;)

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by John 13k

Ram · Answer 2 · 2014-07-22

2

Entering edit mode

9.7 years ago

Bert Overduin ★ 3.7k

bedtools genomecov is what you need:

bedtools genomecov -ibam youralignments.bam -bga -g yourgenome.txt

yourgenome.txt should contain the lengths of the chromosomes of your genome.

This will give you the number of bases for each chromosome, but also for the whole genome, that are 0,1,2,3 etc. times covered by an alignment in your BAM file.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Bert Overduin ★ 3.7k

Ram · Answer 3 · 2014-07-22

1

Entering edit mode

9.7 years ago

Devon Ryan 104k

There's no method within samtools that will give you the length of a read covering the reference. This is rather unsurprising since I can't think of a situation where this would be needed. You can just parse the CIGAR string and get this information, though, so it'll be a simple enough perl program.

BTW, regarding mpileup and depth, neither of those will do what you want.

ADD COMMENT • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry, my question is a bit ambiguous. I mean the number of nucleotides of the reference covered by any read in the bam alignment. Basically, I have a bam alignment created by filtering all reads below a certain mapping quality threshold (bamtools filter -in in.bam -out out.bam -mapQuality "<=someNumber"). I can convert it to a bed file with bamtobed etc, and I would like to know how much of my reference is covered by such reads. Hope that's a bit clearer... I'm using perl to try and calculate it from the bed file but its a bit tricky because of all overlapping regions and multiple regions per chromosome. Thanks!

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by rwn ▴ 590

4

Entering edit mode

Ah, yes, your question was rather poorly worded then. You can just samtools depth foo.bam | awk '{if($3>0) total+=1}END{print total}'.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

That'll do it - cheers :)

ADD REPLY • link 9.7 years ago by rwn ▴ 590

0

Entering edit mode

I guess I get no points for understanding poorly-worded questions ;)

After seeing Pierre's code dump for samtool's depth, I guess our depth/awk solution wont work because depth doesn't print nucleotides where depth is 0. This is a really good example of why that if statement i mentioned over there should be togglable ^_^

ADD REPLY • link 9.7 years ago by John 13k

0

Entering edit mode

Points for everyone! Except me for hastily writing poorly worded questions. But, since I only want those sites in the reference genome that are covered at least 1 times, the fact that samtools depth doesn't count zero-covered sites shouldn't matter. No?

ADD REPLY • link 9.7 years ago by rwn ▴ 590

0

Entering edit mode

Ah, well, I imagined you want [bases with at least 1 read] / [total bases that could have had 1 read]

Note that [total bases that could have had 1 read] is not the same as [genome size], since a genome has gaps and "filler". There is no base 9999 in chromosome 1 for example, since the first 10000 are all "N". The sum of the N's probably won't affect your outcome all that much, but it should get you thinking about your mapping strategy and how that affects coverage. What did/do you do with reads that map to multiple locations? It's important, because if you can't map reads to repeats, should your coverage statistics be counting repeat-region DNA?

I feel like i'm not being very helpful - in short, calculating absolute coverage is messy business. Picking one coverage calculating method, applying it to all your samples, and seeing which maps the most is much easier. For this, Devon's command will work perfectly :)

ADD REPLY • link 9.7 years ago by John 13k

1

Entering edit mode

No no, it's all extremely helpful I think! I'm doing variant calling in bacterial genomes, and have excluded any putative SNP that falls into a region of low-qual mapping (ie a repeat region). I found that applying this filter got rid of a lot of false positive / rubbish I was seeing in the output VCF. My question stemmed from wanting to know what proportion of my reference assembly I was excluding at certain MapQual thresholds - here, I have approximated genome size with assembly length, since you are right, true genome size is unknowable especially for unclosed genomes. But a ballpark figure is fine for me :)

Also, I found that both Devon's and Bert's approaches:

samtools depth foo.bam | awk '{if($3>0) total+=1}END{print total}'

or

bedtools genomecov -ibam foo.bam -bga -g genome.txt | awk '{if($4>0) total += ($3-$2)}END{print total}'

gave the same answer, so take your pick! Thanks again everyone :)

ADD REPLY • link 9.7 years ago by rwn ▴ 590