Tutorial:Determine % of reference genome covered by aligned SAM/BAM
1
5
Entering edit mode
6.6 years ago

Thanks to brs1111 for the idea for this.

If you have aligned your reads to a reference genome, here hg38.fasta, and want to know how much of the reference genome has been covered by aligned reads, the following code will tell you:

#Determine number of bases at 0 read depth
zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

#Determine number of bases at >0 read depth, i.e., non-zero bases
nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

#Calculate the percent of the reference genome coverd by >0 read depth bases
#Round up to 6 decimal places
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

The genomecov function of BEDTools normally computes coverage in your BAM file for each position in your supplied reference genome. The output looks like this:

    chr1    0   11794364    0
    chr1    11794364    11794488    3
    chr1    11794488    11794490    5
    chr1    11794490    11794514    6
    chr1    11794514    11794639    3
    chr1    11794639    14022492    0
    chr1    14022492    14022535    1
    chr1    14022535    97082336    0
    chr1    97082336    97082442    1
    chr1    97082442    97082486    2
    chr1    97082486    97082593    1
    chr1    97082593    97305303    0
    chr1    97305303    97305377    2
...

This should take ~1-5 minutes to run with any BAM file on a 'modern' laptop.

reference-coverage ngs • 4.2k views
ADD COMMENT
1
Entering edit mode

You may use the -bg instead of the -bga option for genomecov, as this will only report non-zero regions. Subtracting the sum of the spanned bp from the total chr length will save you one run of genomecov (have also a look at mawk instead of classical awk, it is way faster for these simply count/filter tasks). Also keep in mind that this result is only meaningful if you prefiltered your bam file to exclude low quality alignments (at least remove MAPQ=0, I always require MAPQ to be 20 or higher) as low quality alignments are either placed random (MAPQ=0) or of low reliability. I may also doubt a bit that a 50x WGS bam will be finished in 5min^^

ADD REPLY
0
Entering edit mode

Thanks for the input friend. It does run rapidly for me on my high-spec laptop... but, when I'm on projects in the Tropics, my CPU typically reaches 90 Celsius (194 Fahenheit) and shuts down the computer :/

ADD REPLY
0
Entering edit mode

MAPQ of 20 is way too low for a clinical application - that's a 1-in-100 chance of misalignment. For clinical tests, I push it up to MAPQ of 60. As bioinformatics methods are now entering the clinical realm, we all need to think more about quality control and proper coding.

ADD REPLY
0
Entering edit mode

That's silly, it's not like you're relying on only a single alignment for variant calling. The MAPQ scores from most aligners aren't as meaningful as you seem to think (most don't even go up to 60, since their bounds are fairly artificial).

ADD REPLY
0
Entering edit mode

Devon, your reply is somewhat naive. There are regions of the genome that exhibit a high level of homology to others, and all reads over these regions will exhibit low MAPQ. Should a variant call over these regions even be trusted if all mapped reads over them have MAPQ <30? If a patient is waiting on a result at the end of the day, would you make that call? I think not.

ADD REPLY
0
Entering edit mode

You think wrong. If that variant made the most sense in the clinical context and when considering the various other discovered variants then of course that's what would and should be reported. Again, you're placing vastly too much trust into MAPQ values.

ADD REPLY
0
Entering edit mode

I upvoted your comment, by the way!

ADD REPLY
3
Entering edit mode
6.6 years ago

Alternative methods:

  1. plotFingerprint --outQualityMetrics from deepTools. The X-intercept is the fraction of the genome not covered. This has the benefit of (A) being multithreaded and (B) allowing filtering of the alignments.
  2. plotCoverage from deepTools.
ADD COMMENT
0
Entering edit mode

Cheers for providing alternatives!

ADD REPLY

Login before adding your answer.

Traffic: 1954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6