Calculate percentage of bases covered less than nX in targeted sequencing experiment
4
0
Entering edit mode
9.3 years ago
Ram 43k

Hi,

I'm looking to calculate some coverage statistics on the alignment in my targeted sequencing experiment. I have a BAM file and a BED file with the target regions. I wish to determine the percentage of bases covered above 30X, 40X, 50X etc. How can I best do this?

Also, we sequenced a LOT of samples in a single lane, so we had to repeat the experiment multiple times to get sufficient coverage. Will the above procedure help me determine if I have reached sufficient coverage? I wish to have at least 70X across the target region, but I can make do with a small percentage of low coverage regions.

I already have stats on average coverage per sample (by dividing number of bases sequenced on target by total length of target region)

coverage NGS targeted sequencing • 8.0k views
ADD COMMENT
3
Entering edit mode
9.3 years ago

Have you tried just piping samtools depth to awk or do you want a more complicated analysis than that?

ADD COMMENT
0
Entering edit mode

Me to samtools: "I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve."

Thank you, Devon. I'll explore samtools!

ADD REPLY
0
Entering edit mode

Devon, samtools depth | awk seems like a really speedy option in my case. I have 30 BAM files and what samtools depth did under 24 hours (all samples put together), GATK's DepthOfCoverage is unable to do for even 5 chromosomes of a single sample.

I haven't tried Brian's tool yet, because I'm not sure if it works with targeted sequencing. I'd prefer if I did not have to do the filtration step after the depth analysis.

ADD REPLY
0
Entering edit mode

Thanks for getting back to us on that, it's good to know. I'm not surprised that GATK's tool is slow, their walker methods are generally absurdly slow. BisSNP is a modified version of GATK for methylation realignment/calling and it's around 10x slower on processing a file than my tool, which uses htslib, for example.

Let us know if you end up using Brian's tool too. At the end of the day, the speed here should just be limited by disk IO.

ADD REPLY
1
Entering edit mode

I'm using my org's HPC, and I/O is great + not in my control, but yeah, GATK seems quite slow and consumes time that I do not have.

I might try Brian's tool today and check its performance for my specific use case, but I must also focus on getting the work done. samtools depth has given me granular (per base depth) data, and I need to stratify this + check across multiple sequencing runs for consistently low covered regions.

My ultimate aim is to determine if another run of sequencing can reduce the fraction of low covered regions, and I guess I should consult with my sequencing team on how the biology and sample prep affects this as well.

ADD REPLY
1
Entering edit mode

Focus on getting the work done? Pfft, what's that about :P

ADD REPLY
0
Entering edit mode

Exploration is good, but not at the cost of timely results, no? "I'm working on it" can buy only so much time :P

ADD REPLY
1
Entering edit mode

Agreed :) If only there were infinite time!

ADD REPLY
2
Entering edit mode
9.3 years ago
Varun Gupta ★ 1.3k

I think GATK has depthofcoverage tool to do exactly what you want and may be then filter out 30X , 40X with awk

ADD COMMENT
0
Entering edit mode

I stumbled across this too, along with samtools | awk, but I wished to consult with folks here first. I feel that discussions, and not questions are the essence of our interactions here :)

Thank you for recommending this - I'll give it a try!

ADD REPLY
0
Entering edit mode

I suspect that Brian's tool will prove the most convenient, they typically do. In any case, I'm curious to hear what turns out to be the best solution. BTW, with the samtools route, do make sure to use -q 1 or higher so you don't double count cases where paired-end reads overlap.

ADD REPLY
1
Entering edit mode

My tool is indeed convenient, but does double-count overlapping paired reads, and uses much more memory (2 bytes/base) than is necessary when processing sorted data. However, double-counting the overlapped region is not necessarily incorrect; they are two independent observations of the same base, from the perspective of sequencing error. I would have much higher confidence in a base call doubly-covered by 10 overlapping read pairs than one covered by 10 non-overlapping reads. Not as much as 20 totally independent non-overlapping reads; somewhere in between, but more on the high side. As such, I prefer overlapped regions double-counted when calculating coverage. Though when variant calling, the best approach (IMO) is to track the number of unique pair IDs that indicate a given variant, which prevents double-counting.

ADD REPLY
0
Entering edit mode

Yeah, one of the things that samtools will do is adjust the quality for bases covered by overlapping reads in a pair (well, it at least penalizes them if they disagree and will in any case convert the phred score of the base coming from read #2 to 0). I should actually check to see if it bumps the scores if the bases agree (I don't think so, but that'd be a good option to add).

ADD REPLY
0
Entering edit mode

Sorry, I edited my response to better reflect my position that calling variants is best done when excluding the X bases closest to the ends (where X is around 8) to allow more specificity with regards to indels versus substitutions. When looking for SNPs only, truncating the ends of both reads to the point that they no longer overlap, and adjusting the qualities accordingly, is probably best, but it reduces the ability to correctly call indels near the middle of the pair.

ADD REPLY
2
Entering edit mode
9.3 years ago

If you want to use BBTools:

pileup.sh in=mapped.bam hist=histogram.txt

That gives the number of bases at every coverage level. To process bam files, samtools must be in the path; sam files don't need it. The input does not need to be sorted.

ADD COMMENT
0
Entering edit mode

Thank you for the detailed response outlining all possible nuances in the approach. I'll try this and get back to you on how it performs!

ADD REPLY
0
Entering edit mode

Brian, the tool is really fast, thank you!

Quick question: My experiment involves targeted sequencing, so is there any problem with the program not accepting a BED file with my target regions?

ADD REPLY
1
Entering edit mode

Pileup's histogram will give you coverage statistics over the entire reference. It can also give you per-base or binned coverage using the "basecov=" or "bincov=" flags. It cannot currently calculate the coverage distribution restricted to the area specified in a bed file, so if that's what you want, you'd have to generate the per-base coverage and process it in conjunction with the bed file yourself.

However, I'm always interested in adding options. Can you explain exactly how you want the bed file to be processed? As in, am I correct in thinking that you just want the coverage distribution of the ranges designated in the bed file?

ADD REPLY
0
Entering edit mode

Thank you for the detailed response, Brian. You are correct - my usage scenario is calculating depth of coverage at particular stretches of sequence as specified in the BED file, where each stretch is an enriched region targeted by our probe. This can currently be done using the -b option in samtools depth and the -L option in GATK's DepthOfCoverage.

BTW, does your code have a web user manual listing all options by any chance?

ADD REPLY
0
Entering edit mode

All of the documentation is currently in the shell scripts. When you run a shell script with no arguments it will print all of the options for that tool.

ADD REPLY
2
Entering edit mode
5.7 years ago
toni ★ 2.2k

I know it's an old thread but as I have tested this tool recently, I would definitely use Mosdepth to answer this question.

It's fast and its --thresholds option does exactly this.

Anthony

ADD COMMENT
0
Entering edit mode

Just a note: --thresholds needs region data supplied using the --by parameter, so it might not be great for across-the-genome metrics.

ADD REPLY
0
Entering edit mode

Can be convert that threshold into percentage so i can calculate how much percentage of particular gene is covered in the data.

ADD REPLY

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6