Question

Determining inhouse cut offs

0

Entering edit mode

7.7 years ago

skbrimer ▴ 740

Hello group,

My boss wants a depth of coverage vs quality of data (i.e. Q10=100x, Q20 =50, Q30=25, etc...) so I'm not sure how to do this since so much of what we do the answers is "it depends". I need someone to explain it like I'm 5 (also my favorite reddit channel).

From my understanding of phred scores Q10 = 0.9 chance of any base in a read being correct. Which should mean that in a 100bp read if the mean phred score is 10 I could have 10 random bases incorrect in the read. However the odds of any random base being correct in the same place more than one time increase exponentially as well. Which would imply that I could have a few reads covering an area with a mean phred score of 10 and still be able to accurately call a SNP with as little at 3x coverage.

using the following:

P=.9 the probability for being right

Po = (1-P)^n the probability for being wrong, where n=the #of observations

so for 1 observation Po would equal 0.1, 2 obs =.01, 3 obs = .001, etc...

This doesn't seem to jive with the current practices and I'm not sure what I am missing something. Can someone point me to a good reference or explain to me where I am wrong. I would really appreciate it.

Thanks, Sean

coverage • 1.5k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.7 years ago by skbrimer ▴ 740

2

Entering edit mode

Your VCF should contain variant quality score as well as depth of coverage. If you plot both of those, you should see some correlation. You could also see at what coverage the quality scores become reasonable (which will depend on the caller).

A similar experiment would be to split the FASTQ into two. Call variants in each. Compare depth of coverage in one sample versus the other for all called variants. There should be poor correlation at low coverage and high correlation at high coverage. Determine where that border is.

There are more complicated approaches if you want to turn this into a paper, but this might be enough to answer your actual question.

ADD REPLY • link 7.7 years ago by igor 13k

0

Entering edit mode

Great idea! Thank you, I will try them and see how they shake out!

ADD REPLY • link 7.7 years ago by skbrimer ▴ 740

1

Entering edit mode

Not entirely what you are looking for, but the authors of this paper investigated required coverage vs change of detecting variants: http://www.ncbi.nlm.nih.gov/pubmed/23773188

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for the link! I will have to read it a few times for it all to sink in but this is helpful!

ADD REPLY • link 7.7 years ago by skbrimer ▴ 740

1

Entering edit mode

It's a interesting article to have an argument for requiring a certain coverage, instead of people just without reason choosing a cut-off of e.g. 20x.

In addition, the coverage will not always be linear to your likelihood of having a correct variant call, for example the presence of elements such as a short tandem repeat, a homopolymer or a segmental duplication with a paralogous sequence variant.

The coverage of a position is a rough parameter about the likelihood of variant identification, and the variant quality score (among other parameters such as strand bias) takes the coverage into account.

It's important to make a clear distinction between base quality, variant quality (and mapping quality).