Biostar Beta. Not for public use.
What Is The Fastest Method To Determine The Number Of Positions In A Bam File With >N Coverage?
4
Entering edit mode
15 months ago
Washington University School of Medicin…

I have two very large BAM files (high depth, human, whole genome). I have a seemingly simple question. I want to know how many positions in each are covered by at least N reads (say 20). For now I am _not_ concerned about requiring a minimum mapping quality for each alignment or a minimum read quality for the reads involved.

Things I have considered:

  • samtools mpileup (then piped to awk to assess the minimum depth requirement, then piped to wc -l). This seemed slow...
  • samtools depth (storing the output to disk so that I can assess coverage at different cutoffs later). Even if I divide the genome into ~133 evenly sized pieces, this seems very slow...
  • bedtools coverage?
  • bedtools genomecov?
  • bedtools multicov?
  • bamtools coverage?

Any idea which of these might be fastest for this question? Something else I haven't thought of? I can use parallel processes to ensure that the performance bottleneck is disk access but want that access to be as efficient as possible. It seems that some of these tools are doing more than I need for this particular task...

ADD COMMENTlink
1
Entering edit mode

Of the bedtools options, genomecov is the one you want (see answer from Jordan). It is admittedly not the fastest - there are ways to speed it up, but we haven't yet done so. I'd be interested to know the runs times you face as a function of the number of reads.

ADD REPLYlink
0
Entering edit mode

All the answers were helpful but 'bedtools genomecov' was the method I wound up using in the end. 'samtools depth' has convenient output. I expect Pierre's solution is the probably the fastest.

ADD REPLYlink
7
Entering edit mode
12 months ago
France/Nantes/Institut du Thorax - INSE…

Use the basic code for the Samtools API: http://samtools.sourceforge.net/sam-exam.shtml

change:

 if ((int)pos >= tmp->beg && (int)pos < tmp->end)

to

 if (n>20 && (int)pos >= tmp->beg && (int)pos < tmp->end)

compile: (sould be something like:)

gcc -O3 -o mytool mytool.c -I /path/to/samtools -L /path/to/samtools -lbam -lz

Edit: and you can run this code using one process per chromosome...

ADD COMMENTlink
3
Entering edit mode
14 months ago
Jordan ♦ 1.1k
Pittsburgh

You could try bedtools genomeCoverageBed. It takes in bam file and reference genome as input and outputs the coverage stats.

Here is an example:

genomeCoverageBed -ibam file.bam -g human_g1k_v37.fasta > coverage.txt

The output file has 5 columns:

genome    0    1670110268    3101804739    0.538432
genome    1    696287280    3101804739    0.224478
genome    2    329358858    3101804739    0.106183
genome    3    144809487    3101804739    0.0466856
genome    4    66917607    3101804739    0.0215738
genome    5    32869034    3101804739    0.0105967
genome    6    17980502    3101804739    0.00579679
genome    7    10955385    3101804739    0.00353194
  • Column1: Genome/Chromosome name
    Column2: Depth
    Column3: Number of bases in the bam file with depth equal to Column2
    Column4: Genome Length
    Column5: Fraction of human genome with the depth equal to Column2

So, if you need coverage >= 5, in this case it would be 6.26% (you just add 0.538432+0.224478+0.106183+0.0466856+0.0215738 and subtract from 1).

I hope this is what you are looking for.

ADD COMMENTlink
1
Entering edit mode

This just takes 5-10 min depending on the system you have. Never exceeded 10 min for me.

ADD REPLYlink
1
Entering edit mode
14 months ago
United States

I pipe samtools depth to a C++ program I wrote to count depth X. From these counts I can quickly calculate the precent of bam covered at X or greater.

I also take missing sites into account.

chr1 1 10

chr 34 5

2-33 = 0;

ADD COMMENTlink
0
Entering edit mode

Yeah, samtools depth is the method I was using. But for these massive BAM files, just running samtools depth seems to take a long time...

ADD REPLYlink
0
Entering edit mode

I've switched over to using sambamba depth, it's much faster than samtools depth and is also multithreaded. It also gives a parsed pileup like output with coverage by the type of base and has additional options to calculate coverage in overlapping widows.

ADD REPLYlink
0
Entering edit mode

I don't see an equivalent to depth or bedcov in the sambamba docs

ADD REPLYlink
0
Entering edit mode
12 months ago
drkennetz • 370

To update this thread, sambamba depth is a wonderful library that supports multithreading which drastically increases speed but provides the same functionality as samtools depth.

I would recommend everybody check out this package! Sambamba

ADD COMMENTlink
0
Entering edit mode
12 months ago
Paul ♦ 1.3k
European Union

For very large BAM files (like WGS) I was tested Mosdepth Link. It is very fast and there are lot of option. Mosdepth can report the mean depth in 500-base windows genome-wide info under 9 minutes of user time with 3 threads.

Example of syntax:

mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1