Question

Getting The Average Coverage From The Coverage Counts At Each Depth.

1

Entering edit mode

10.9 years ago

Jordan ★ 1.3k

Hi,

I have read quite a few posts here about coverage already. But I still had a few questions. I have a BAM file I'm trying to find the coverage of it (typically like say 30X).

So, I decided to use genomeCoverageBed for my analysis. And I used the following command:

genomeCoverageBed -ibam file.bam -g ~/refs/human_g1k_v37.fasta > coverage.txt

As many are aware, the output of the file looks something like this:

genome    0    26849578    100286070 0.26773
genome    1    30938928    100286070     0.308507
genome    2    21764479    100286070    0.217024
genome    3    11775917    100286070    0.117423
genome    4    5346208    100286070    0.0533096
genome    5    2135366    100286070    0.0212927
genome    6    785983    100286070    0.00783741
genome    7    281282    100286070    0.0028048
genome    8    106971    100286070    0.00106666
genome    9    47419    100286070    0.000472837
genome    10    27403    100286070    0.000273248

To find the coverage, I multiplied col2 (depth) with col3 (number of bases in genome with that depth) and then summed the entire column. Then, I divided it by genome length to get the coverage. In this case, col2 * col3 is:

And the sum is: 150098740. Since the genome length is 100286070, the coverage is 150098740/100286070 = 1.5. That is to say it is, 1.5X. I have only considered the first 10 depth from the file here, but you get the idea. So, is this the right way to get the physical coverage?

Note: While the output file (coverage.txt) gives individual chromosome details, I only took the genome details i.e., col1 labeled as genome.

coverage depth-of-coverage bedtools • 5.5k views

ADD COMMENT • link updated 10.9 years ago by Jorge Amigo 14k • written 10.9 years ago by Jordan ★ 1.3k

score 1 · Answer 1 · 2013-05-22

Picard tools' CalculateHsMetrics does perfectly the job directly from the bam file, allowing even to distinguish target from bait design in case it's needed (not on the following example though):

java -jar \$HOME/bin/picard-tools/CalculateHsMetrics.jar \
REFERENCE_SEQUENCE=hg19.fa \
TARGET_INTERVALS=intervals.bed \
BAIT_INTERVALS=intervals.bed \
INPUT=input.bam \
OUTPUT=hsmetrics.txt

all the output fields in that hsmetrics.txt are described at the HsMetrics' class description, which of course include the MEAN_TARGET_COVERAGE you're after.

score 0 · Answer 2 · 2013-05-20

0

Entering edit mode

10.9 years ago

Istvan Albert 100k

You kind of seem to go the other way, from more specific answers to less specific ones. Typically the average coverage is very easy to get:

number of reads * length of each read / size of the genome

You have results already counted by individual coverage, and while the formula that you use seems to be right note how the result that you are getting are not that useful. You average coverage ends up to be 1.5x even though 26% of the bases have no coverage at all and 30% of reads have a coverage of 1. I think this is because your distribution is heavily skewed, in which case averages are not that meaningful quantities.

ADD COMMENT • link 10.9 years ago by Istvan Albert 100k

0

Entering edit mode

Does this logic apply to exomes too? Like say, human exome is 30MB I think. And I have 30 million mapped reads, with 32bp reads. Then, the average coverage would be 30X right?

ADD REPLY • link 10.9 years ago by Jordan ★ 1.3k

0

Entering edit mode

yes, to get the idealized coverage you would divide with the total length of the DNA that is supposedly the source of the reads.

ADD REPLY • link 10.9 years ago by Istvan Albert 100k