Question

Estimate Expected Frequency Of A Non-Snp Mutation From 1000 Genomes Data

0

Entering edit mode

11.0 years ago

kiekyon.huang ▴ 10

From a cohort of 50 patient samples, I found 3 germline exonic/UTR mutations (non SNP) in one gene (FANCA). All three mutations are located at different sites of FANCA

I would like to estimate the expected frequency of FANCA mutations in the general populations using the 1000genomes data to ask if the 3/50 mutations is statistically more than expected.

Has anyone done anything like this before? Thanks

1000genomes • 2.7k views

ADD COMMENT • link updated 11.0 years ago by Giovanni M Dall'Olio 28k • written 11.0 years ago by kiekyon.huang ▴ 10

1

Entering edit mode

I am sorry that I cannot answer your question directly. Approaching this problem the way you describe raises an immediate follow up question for me: Given a set of rare variants from data without also giving a family pedigree, how can one differentiate mutations from SNPs? With your patient data you might have looked for novel variants (e.g. not in dbSNP or in 1kG). That might introduce circular inference, because according to your definition of 'likely a mutation' your new variants are disjoint from the 1kG. So you could as well have picked up some variation with a MAF that was below the detection limit of the 1kG data. To make such observation is maybe not that striking, given that for every new genome analyzed, also novel variants will be called. At least a few, how many could be estimated by simulation, e.g. taking 1 genome out of the 1kG data, re-calling the variants for 999 genomes, then calling variants for the 1 taken out.

To find out whether or not variants are associated with the phenotype it might be more appropriate to re-frame your problem in the setting of standard association testing, as this also includes the phenotype, while the sole testig for deviation from mutation rate does by design not take into account any phenotype and thus cannot be interpreted to deliver any information about phenotype-genotype association.

ADD REPLY • link 11.0 years ago by Michael 54k

0

Entering edit mode

Also, you need to explain how a germ-line mutation can be linked to a somatic phenotype? Or are you looking at a germ related phenotype?

ADD REPLY • link 11.0 years ago by Michael 54k

0

Entering edit mode

Your phrase is incomplete. "To ask if the 3/50 mutations is statistically more than expected" -> more what than expected??

ADD REPLY • link 11.0 years ago by Giovanni M Dall'Olio 28k

score 2 · Answer 1 · 2013-04-17

Having written my comment, I realized that probably the most straight-forward approach to estimate the occurrence of novel variants in a dataset of a certain size might be to subsample the 1kG data. Note that this isn't an estimate for the mutation rate unless we have a method to single out point mutations. Earlier, SNPs have been defined as variant with a MAF of at least some threshold (e.g. 1%), I am not sure how much sense such an arbitrary threshold makes in this case. Also, a sampling approach might not be practicable because of the computational costs and sensitivity to the variant calling pipeline and its parameters.

Also, when looking for certain estimates of mutation-rates:

The human mutation rate is higher in the male germ line (sperm) than the female (egg cells), but estimates of the exact rate have varied by an order of magnitude or more. [...] Using data available from whole genome sequencing, the human genome mutation rate is similarly estimated to be ~1.1×10−8 per site per generation. http://en.wikipedia.org/wiki/Mutation_rates

Using this probability as p (or any other estimate) for a single event, the probability of observing n = 3 or more mutations (successes) in k trials (k := number of exonic bases in gene) using the cummulative distribution function of the Binomial distribution. In R you can use the function pbinom(n, k, p, lower.tail = FALSE, log.p = FALSE) to calculate this probability. Given the CDS of human FANCA is 4368 nt, and assuming the highest mutation rate I found of 2.7e-8 this yields: 8.048903e-18 which looks significant, but depends on the purity of the sequences, if your variants are sampled from a mixture, this naive calculation is void.

This relies on the assumption that the mutation events are independent of each other. This should be justified for real mutations, but not for their accumulation because of varying levels of purifying selection on certain regions. As you probably compared somatic with germ-line cells, the mutations might be real point mutations, and there should not be a significant accumulation of mutation rates in any given region (null-hypothesis).