Biostar Beta. Not for public use.
Tools To Calculate Average Coverage For A Bam File?
56
Entering edit mode
20 months ago
Biomed 4.5k
Bethesda, MD, USA

I would like to get the average coverage of all the captured bases in a bam file. What would be the best way to do this? What I am looking is a simple one number like 40X. Given that there may be millions of bases sequenced in a next gen study I would like to get the overall average coverage for all these bases as a single number.

Thanks

ADD COMMENTlink
2
Entering edit mode

People seem to be equally devided between GATK, Samtools and BedTools.

ADD REPLYlink
1
Entering edit mode

Why is it wrong to take the number of sequenced bases and divide by the genome size?

ADD REPLYlink
4
Entering edit mode

because you cannot align all the bases on the genome, some reads are optical duplicates, etc...

ADD REPLYlink
1
Entering edit mode

Okay, contamination is a fair point, but even if some reads fail to match, that could still be a sign of genetic divergence, and should still count towards genome coverage. Do any of the other suggestions deal with duplicates implicitly?

ADD REPLYlink
1
Entering edit mode

Most answers seems to be very old and hence would like to have updated suggestions.

I have 6 bam files and I have used samtools depth to calculate chromosome wise depth for all chromosomes and then plotted in R. While looking at the plots at 2-3 places, depth shows upto 200-3500 and hence I would like to calculate average read depth of each chromosome from each bam file. For e.g.: 1) average depth of chr1 from bam1, average depth of chr1 from bam2, ...until bam6. 2) average depth of chr1 from all 6 bam together.

Kindly guide.

Thanks.

ADD REPLYlink
1
Entering edit mode

Try the solutions if they don't work then you need to reframe the question with your commands and write a new query in a new post. This is advised. And yes paired end should work.

ADD REPLYlink
0
Entering edit mode

in addition, if you have already plotted depths it means that you have got a temporal R data.frame (or similar type) in there containing chromosome, position and depth information. splitting such data.frame by chromosome and calculating average values should be straight-forward. I would just suggest to take into account empty values not reported by samtools depth dividing each chromosome depths' sum by the size of the chromosome, and not by the number of depth values.

ADD REPLYlink
0
Entering edit mode

Qualimap is very fast and excellent tool. thanks

ADD REPLYlink
0
Entering edit mode

Hi,

I have multiple paired-end bam files from RNA-Seq data, already aligned and computed depth with samtools depth > bam.depth. Would any one please guide for any straight forward/latest way to compute coverage plots, avg. coverage plots and other metrics from depth files?

Picard CollectRNAseqmetrics, RSeQC and QoRTs have not been helpful. :-(

Thanks.

ADD REPLYlink
0
Entering edit mode

Have you run any of the other tools mentioned in this thread (e.g. Qualimap, pileup from BBMap)?

ADD REPLYlink
0
Entering edit mode

@genomax My bam files are very big and Qualimap has not been helpful, so as bbmap which is Java-based (memory issues like Picard and QoRTs) and hence I used samtools depth. My depth files are also in GB.

ADD REPLYlink
0
Entering edit mode

Both programs will happily accept larger allocations of RAM and can open very large files. If you don't have access to a machine with more RAM then that is a different problem.

ADD REPLYlink
33
Entering edit mode
20 months ago
Athens, GA, USA

Try the genomeCoverageBed tool in the BEDtools package, which takes a BAM or BED file as input and:

computes a histogram of feature coverage (e.g., aligned sequences) for a given genome. Optionally, by using the –d option, it will report the depth of coverage at each base on each chromosome in the genome file (-g).

The way we run this is to first make a tab-delimited genome.txt file with "contig_name contig_size" information for all contigs in your genome and then run genomeCoverageBed on a sorted BAM file:

$ genomeCoverageBed -ibam sortedBamFile.bam -g genome.txt > coverage.txt

This will give you a summary histogram of coverage across each contig and for the entire genome, from which you can obtain the modal or mean value (the single fold-coverage value you are looking for).

ADD COMMENTlink
0
Entering edit mode

With your command, I have different value for the same contig. For instance:

jcf7180001864843    2   4   137928  2.90006e-05
jcf7180001864843    3   28  137928  0.000203004
jcf7180001864843    4   10  137928  7.25016e-05

What does it mean ?. How can I have one value for each contig.

ADD REPLYlink
3
Entering edit mode

That's because it reports a histogram. From the documentation: The values in the output are:

  1. chromosome (or entire genome) [eg jcf7180001864843]
  2. depth of coverage from features in input file [eg. 2]
  3. number of bases on chromosome (or genome) with depth equal to column 2. [eg. 4]
  4. size of chromosome (or entire genome) in base pairs [eg. 137928]
  5. fraction of bases on chromosome (or entire genome) with depth equal to column 2. [2.90006e-05]

[all values taken from your first row]

If you want to get the average coverage: add up the product of bases per coverage [2* 4+3* 28+4* 10+...] and divide by the total number of bases [137928].

ADD REPLYlink
0
Entering edit mode

Hello Ronald,

I have a question about what you are saying. So for the average coverage you are stating, that would be per scafffold correct? I attempted to open a topic recently and was referred to this one :)

Cheers

ADD REPLYlink
1
Entering edit mode

Is your BAM file sorted (as noted on the help doc page linked by @student-t above)?

ADD REPLYlink
31
Entering edit mode
6.3 years ago
russdurrett • 310

if you have samtools and awk, you can keep it simple with this:

samtools depth  *bamfile*  |  awk '{sum+=$3} END { print "Average = ",sum/NR}'

and with stdev:

samtools depth *bamfile*  |  awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}'
ADD COMMENTlink
2
Entering edit mode

Not sure this works. It calculates coverage over regions but does not account for regions that were not covered with a read. Then it divides all the sum with the number of rows that were reported... but the positions without a read hit are not included. Basically - it gives a too high result.

ADD REPLYlink
6
Entering edit mode

samtools depth -a

Using the '-a' flag outputs all positions (including zero depth), which should solve this problem.

ADD REPLYlink
3
Entering edit mode

[FYI] Only available from v1.3, I couldn't find that parameter in my version

https://github.com/samtools/samtools/issues/374

ADD REPLYlink
3
Entering edit mode

That's true. To be fair, the denominator needs to be the size of genome where the BAM file was generated against, rather than the number of bases covered at least once (that's what NR is in this context). To get the genome size from the BAM file:

samtools view -H *bamfile* | grep -P '^@SQ' | cut -f 3 -d ':' | awk '{sum+=$1} END {print sum}'
ADD REPLYlink
0
Entering edit mode

Please correct me if I am wrong, but none of these answers seem to work with nanopore data.

ADD REPLYlink
0
Entering edit mode

why would it not?

ADD REPLYlink
0
Entering edit mode

In the first command, its $3 for me

ADD REPLYlink
0
Entering edit mode

I agree, and edited Russ' answer accordingly to avoid confusing anyone else.

ADD REPLYlink
0
Entering edit mode

samtools depth is the most straightforward way to do this

ADD REPLYlink
21
Entering edit mode
17 months ago
Bergen, Norway

And the mandatory R solution. The Rsamtools package can be used to read in the BAM file.

Here is a code example:

library(Rsamtools)
bamcoverage <- function (bamfile) {
  # read in the bam file
  bam <- scanBam(bamfile)[[1]] # the result comes in nested lists
  # filter reads without match position
  ind <- ! is.na(bam$pos)
  ## remove non-matches, they are not relevant to us
  bam <- lapply(bam, function(x) x[ind])
  ranges <- IRanges(start=bam$pos, width=bam$qwidth, names=make.names(bam$qname, unique=TRUE))
  ## names of the bam data frame:
  ## "qname"  "flag"   "rname"  "strand" "pos"    "qwidth"
  ## "mapq"   "cigar"  "mrnm"   "mpos"   "isize"  "seq"    "qual"
  ## construc: genomic ranges object containing all reads
  ranges <- GRanges(seqnames=Rle(bam$rname), ranges=ranges, strand=Rle(bam$strand), flag=bam$flag, readid=bam$rname )
  ## returns a coverage for each reference sequence (aka. chromosome) in the bam file
  return (mean(coverage(ranges)))      
}

And the output:

> bamcoverage("~/test.bam")
gi|225184640|emb|AL009126.3| 
                    15.35720

The good thing about this is you can do much more with the genomic ranges object than computing the coverage, the downside is, that the code is not that efficient.

ADD COMMENTlink
0
Entering edit mode

@Michael : What maximum size of input bam file that can be handled by this solution ? It slurps the whole bamfile in memory ??

ADD REPLYlink
1
Entering edit mode

Sorry for the late reply. Yes it does, and it is not very memory efficient. For a 850MB BAM file with approximately 17 million alignments the R process uses up to 15GB of memory during processing.

ADD REPLYlink
4
Entering edit mode

A reply four years after.....

ADD REPLYlink
0
Entering edit mode

@Micheal: Will it work for paired end BAM files?

ADD REPLYlink
0
Entering edit mode

Hi! I used this R code with Rsamtools with my bam file and the result was:

6503.159

What could have gone wrong?

ADD REPLYlink
0
Entering edit mode

You might have incredibly high coverage :) I would try a different tool, now 8 years later. Honestly, this code was only intended to show it can be done in R too. But I am not even sure it works correctly.

ADD REPLYlink
0
Entering edit mode

Thanks!!! I'll look for another tools :) genomecoverage from bedtools looks good too. I'll try that one. Thanks for the advice!

ADD REPLYlink
15
Entering edit mode
15 months ago
France/Nantes/Institut du Thorax - INSE…

did you try the GATK "DepthOfCoverage" ? http://www.broadinstitute.org/gsa/wiki/index.php/Depth_of_Coverage_v3.0

or you can run 'samtools pileup' and calculate the mean value for the coverage column.

ADD COMMENTlink
2
Entering edit mode

I have used the samtools method (which excludes bases with 0 coverage) and the BEDTools version (which includes bases with 0 coverage). The former will give a higher mean coverage. This will have an impact on genome where there are large areas that are not sequenced.

ADD REPLYlink
1
Entering edit mode

Sorry I am new to bioinformatics. I have used 'samtools mpileup', but how can I get mean value for the depth column. Thanks.

ADD REPLYlink
1
Entering edit mode

I tried GATK DepthOfCoverage last July with no luck - complicated java.lang.RuntimeExceptions. I did find success with BEDtools however (see answer below/above)

ADD REPLYlink
1
Entering edit mode

I use this method to calculate % of sites with at least a coverage of 20x in a list of sites from a bed. It provides that number directly as output.

ADD REPLYlink
0
Entering edit mode

That's the gte_20 value from the sample_cumulative_coverage_proportions file, correct?

ADD REPLYlink
0
Entering edit mode

I was interested in this issue too, but I wanted to know the coverage for certain genes. after looking to this tool's wiki I've realized that I can feed it with a list of genes intervals (which of course I'll have to retrieve from UCSC or Ensembl) and I'll surely sort my problem out! look what an unexpected great piece of information I've just found! once again thanks a lot, Pierre.

ADD REPLYlink
0
Entering edit mode

Thanks Pierre. I haven't tried Depthof Coverage but I tried Samtools. Samtools would give you a base by base coverage, and I use it successfully like Jorge explained but I 'll look at GATK for sure.

ADD REPLYlink
14
Entering edit mode
6.3 years ago
Palo Alto

Here's a very simple perl script I wrote that does exactly this that you're free to use:

($num,$den)=(0,0);
while ($cov=<STDIN>) {
    $num=$num+$cov;
    $den++;
}
$cov=$num/$den;
print "Mean Coverage = $cov\n";

Just pipe samtools pileup to it like so:

/path/to/samtools pileup in.bam | awk '{print $4}' | perl mean_coverage.pl

It will also work easily for genomic regions. Just index the BAM file and use samtools view like so:

/path/to/samtools view -b in.bam <genomic region> | /path/to/samtools pileup - | awk '{print $4}' | perl mean_coverage.pl

Hope someone finds this useful.

ADD COMMENTlink
4
Entering edit mode

One more comment on this pipeline... In my data, where I have relatively low coverage there are quite a few genomic positions not covered at all. These will not show up in the pileup command, so I guess that the mean coverage calculated this way will be the mean coverage of the _covered_ positions. Am I right? Is there a way (in the "samtools pileup" command) to output all positions even if they are not covered by any read?

ADD REPLYlink
0
Entering edit mode

looks promising. Thanks

ADD REPLYlink
0
Entering edit mode

I am certainly going to try this. Thanks for the answer.

ADD REPLYlink
0
Entering edit mode

Nice answer! Tried it and it's working! The only thing I'd like to mention is that I think that the BAM file needs to be sorted.

ADD REPLYlink
0
Entering edit mode

That's correct, panos. This simple solution is only if every base is covered at least once because samtools pileup doesn't report bases with zero coverage. Thanks for pointing that out!

ADD REPLYlink
0
Entering edit mode

can you give me a detailed description about $cov=[HTML] you used ? Thanks!

ADD REPLYlink
0
Entering edit mode

Very useful thanks! Used it on my own dataset (sorry to bump an old thread but still very handy)

ADD REPLYlink
0
Entering edit mode

I made a BioStars account to upvote your contribution. I've been trying to find something this simple and straightforward for some time now.. Thanks!

ADD REPLYlink
3
Entering edit mode

this is a very old answer. you would find then useful a simpler samtools depth based solution:

samtools depth [-b regions.bed] input.bam | awk '{c++;s+=$3}END{print s/c}'
ADD REPLYlink
0
Entering edit mode

@Jorge Amigo

samtools depth bamfile | awk '{sum+=$3} END { print "Average = ",sum/NR}'

by @russdurrett also gives same output for my paired-end bam file from RNA-Seq:-)

ADD REPLYlink
1
Entering edit mode

that's because NR is an awk variable that stores the number of rows, therefore the c variable on my code is not wrong, but it is superfluous.

ADD REPLYlink
0
Entering edit mode

I ran:

samtools depth zm_cp.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'

and the result was:

Average =  6495.19

:O is this okay?

ADD REPLYlink
0
Entering edit mode

Does this also works with mpileup? I ran:

samtools mpileup zm_cp.bam | awk '{print $4}' | perl mean_coverage.pl

and this was the result:

[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

Mean Coverage = 5107.87408859271

:O is this okay?

ADD REPLYlink
5
Entering edit mode
6.3 years ago
Heikki • 360

@michael.james.clark, thank you. I wrote for myself a script that I can add to when the need arises:

#!/usr/bin/env perl

use strict;
use warnings;

# Usage: samtools pileup file.bam | bam_coverage.pl

my $num;        # per residue coverage
my $len;        # sequence length counter 
my $min = 1000; # minimum coverage
my $max = 0;    # maximum coverage

while (<>) {
    my @a = split /\t/;
    $num += $a[3];
    $min = $a[3] if $min > $a[3];
    $max = $a[3] if $max < $a[3];
    $len++;
}

printf  "Mean coverage  : %.1f\n", $num/$len;
printf  "Coverage range : %d - %d\n", $min, $max;
ADD COMMENTlink
1
Entering edit mode

Heiki thanks for the codes. It is working for me

ADD REPLYlink
0
Entering edit mode

This doesn't work because mpileup does not report the bases with coverage of zero

ADD REPLYlink
4
Entering edit mode
14 months ago
Rm 7.8k
Danville, PA

use

samtools pileup accepted_hits.bam | awk '{ count++ ; SUM += $4 } END { print "Total: " SUM "\t" "Nucleotides: " count "\t" "Average_coverage: " SUM/count }'

(use $8 instead; if pileup with -cf option)

ADD COMMENTlink
0
Entering edit mode

I used the above command to extract the average coverage on my aligned bam files and my statistics are: Total: 8011581243 Nucleotides: 510234050 Average_coverage: 15.7018

I want to know that the average_coverage is stating that on a distribution each of my base is read atleast 15 time right? What does the total signifies and the Nucleotides count: 510234050. Is this the total no of bases that are in the exonic region? then how will I calculate the number of reads from here that actually got mapped on in the exome region? Can we do this from the nucleotides count? I know for each reads consist of 100 bases so if I divide this 510234050 bases with 100 it gives me the reads but that will not give me the reads that actually got mapped in the exome region. How t extract that?

ADD REPLYlink
4
Entering edit mode
16 months ago
William ♦ 4.4k
Europe

How about just running qualimap in GUI mode or to create a QC PDF? Your can run it on the whole genome or on 1 or multiple regions using a bed file.

global coverage stats

coverage histogram

percentage of genome covered x times plot

http://qualimap.bioinfo.cipf.es/

ADD COMMENTlink
4
Entering edit mode
14 months ago
Belgium

A modern, very fast, solution for this would be mosdepth.

ADD COMMENTlink
0
Entering edit mode

how do i install other than bioconda?

ADD REPLYlink
0
Entering edit mode

Why not use bioconda?

More installation options and instructions can be found on the GitHub page.

ADD REPLYlink
3
Entering edit mode
14 months ago
5heikki 8.4k
Finland

This is for people who have a sam/bam file consisting of reads mapped to multiple contigs (samtools needs to be in $PATH):

BBMap: http://sourceforge.net/projects/bbmap/

bbmap/pileup.sh -h

Written by Brian Bushnell
Last modified May 19, 2015

Description:  Calculates per-scaffold coverage information from an unsorted sam file.

Usage:  pileup.sh in=<input> out=<output>

Input may be stdin or a SAM file, compressed or uncompressed.
Output may be stdout or a file.

Input Parameters:
in=<file>           The input sam file; this is the only required parameter.
ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.
unpigz=t            Decompress with pigz for faster decompression.

..
ADD COMMENTlink
0
Entering edit mode

does it work with a bam file ?

ADD REPLYlink
1
Entering edit mode

Yes it does. Use the latest version.

$ pileup.sh
Written by Brian Bushnell
Last modified September 1, 2016
Description:  Calculates per-scaffold coverage information from an unsorted sam or bam file.
ADD REPLYlink
0
Entering edit mode

Could you introduce a bed file, say exome, and get the coverage stats?

ADD REPLYlink
2
Entering edit mode
6.3 years ago
Mdperry • 40

BioPerl has a library named Bio-Samtools which provides a familiar Perl/BioPerl access to samtools. I heard the author, Lincoln Stein, give a talk on on how he wove the two together use XS. As I recall, one of his examples was exactly the question you posed

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3