Median coverage from BAM file
2
1
Entering edit mode
6.3 years ago
firestar ★ 1.6k

Any tool that can do median per base coverage over pre-defined windows from a BAM file?

I could make the actual per base coverage using samtools depth or genomeCoverageBed.

chr pos read_depth
1   1   6
1   2   8
...

I already have the windows prepared using

bedtools makewindows \
    -g "genome.fa.fai" \
    -w 1000 > "windows.bed"

I am looking for a result like below:

Median per base coverage over 1000bp windows

chr start  end   median_read_depth
1   1      1000  5
1   1000   2000  8
...
coverage DNA-Seq • 4.2k views
ADD COMMENT
2
Entering edit mode
6.3 years ago

I 've written : http://lindenb.github.io/jvarkit/BamStats04.html

$ java -jar dist/bamstats04.jar -B src/test/resources/toy.bed.gz src/test/resources/toy.bam 2> /dev/null | column -t 

#chrom  start  end  length  sample  mincov  maxcov  meancov  mediancov  nocoveragebp  percentcovered
ref     10     13   3       S1      3       3       3.0      3.0        0             100
ref2    1      2    1       S1      2       2       2.0      2.0        0             100
ref2    13     14   1       S1      6       6       6.0      6.0        0             100
ref2    16     17   1       S1      6       6       6.0      6.0        0             100
ADD COMMENT
1
Entering edit mode

Oh that's brilliant! Includes all sorts of summaries.

ADD REPLY
0
Entering edit mode

reformat.sh from BBMap suite also produces summaries of all sorts. Look at the histogram parameters.

ADD REPLY
0
Entering edit mode

I have tried bbmap pileup and it was great. It's blazing fast. Unfortunately, it only gives the total number of mapped reads (sum) over windows. It would've been awesome if it could provide other summary metrics (mean, median etc) as well. I am not familiar with reformat.sh, but I will look into it.

ADD REPLY
0
Entering edit mode
6.3 years ago

Using BEDOPS bedmap and bam2bed, you can calculate a BED file containing reference elements (genomic windows over which you want to calculate coverage) and a semi-colon-delimited list of read overlap sizes in the final column:

$ bedmap --echo --echo-overlap-size --delim '\t' windows.bed <(bam2bed reads.bam) > coverage.bed

Once you have this, you can use Unix tools like paste and awk to paste the first through N-1 columns together with the median read overlap size calculated from the final, Nth column:

$ paste <(awk '{ for (i=1; i<(NF-1); i++) { printf("%s\t", $i); } printf("%s", $(NF-1)); }' coverage.bed) <(awk '{ n=split($NF, a, ";"); n=asort(a); if (n % 2) { printf("%d", a[(n + 1) / 2]); } else { printf("%d", a[(n / 2)] + a[(n / 2) + 1]) / 2.0); } }' coverage.bed) > answer.bed

In answer.bed, the first through N-1 columns contain the window interval, and the last column contains the median overlap size.

You can put in as many metrics here as you want, mean, etc. just by adjusting the second awk command to print those statistics.

ADD COMMENT

Login before adding your answer.

Traffic: 1685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6