Question

Comparing Snps Across Populations

6

Entering edit mode

13.6 years ago

Andrea_Bio ★ 2.8k

Hello

I have a very high-level exploratory question about SNPs and comparative genomics.

Lets say I had 2 different populations of the same species and one of those populations was resistant to a particular disease and one of them was susceptible. Naturally I want to try and find out what confers resistance to the healthy population. How would I go about this using SNP data? I appreciate this is a huge question but I'm just trying to find out what areas I need to go and research in more detail.

Lets say I had a full genome sequence for individuals from both populations and knew their SNP alleles. Can the allele frequencies of SNPs in the 2 populations tell me anything? If the allele frequencies of a SNP in the 2 populations differ is that potentially intesting (although it might represent some other difference between the 2 populations other than the disease susceptibility)? How many individuals would i need each population to compare the allele frequencies?

Are there any sorts of statistical analysis I might perform? For example if I found an area of the genome had a higher/lower SNP distribution than the rest of the genome does this tell me anything? For example does a lower SNP distribution mean it is conserved and subject to positive selection? It's been a long time since I studied this so I could be remembering this all wrong.

Many thanks Thanks

snp comparative statistics • 9.0k views

ADD COMMENT • link updated 13.5 years ago by David Quigley 11k • written 13.6 years ago by Andrea_Bio ★ 2.8k

score 7 · Answer 1 · 2010-10-09

7

Entering edit mode

13.6 years ago

Haibao Tang 3.0k

This is a common problem for association genetics. In theory, the approach of looking at SNP frequencies could work, but you will be confused with too many false positives. The problem is due to population structure.

The ideal case is that the two populations would differ in and only in the responsible SNP (like mutant and wild-type) - however that's not the general case. You'll most likely get at least tens of thousands of SNPs with frequency differences and you'll have no idea which SNP is involved in disease susceptibility. The more divergent your populations are, the harder it gets - say you want to find what SNP makes human speak while chimps do not, you'll end up with millions of candidate SNPs that have different freq between these species.

Do you have more information regarding where your candidate might be? that could narrow down the search so that your method will be feasible.

To your last question - based on coalescent theory, the region that has unusually low SNP rate might be resulted from selective sweep, which might indicate site undergoing positive selection.

ADD COMMENT • link 13.6 years ago by Haibao Tang 3.0k

0

Entering edit mode

Thanks for your answer. Lets say we had the ideal case and there was one SNP with frequency differences, what frequency difference would you expect to see? For example, if the susceptible population had a MAF frequency of 5% and a major allele frequency of 95% and the tolerant population had a MAF frequency of say 20% could you say perhaps the minor allele is conferring resistance? Any other basic examples welcomed. I'm just trying to get a basic 'feel'

ADD REPLY • link 13.5 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

Can you estimate the spread of disease resistance in either population and also the penetrance of the mutation? Knowing these two might help to predict the differences of MAF.

ADD REPLY • link 13.5 years ago by Haibao Tang 3.0k

score 6 · Answer 2 · 2010-10-08

6

Entering edit mode

13.6 years ago

Mrawlins ▴ 430

The largest differences in SNP distribution between the two populations are ideal targets for further study. Ideally you would find a single SNP that is 100% one way in one phenotype and 100% the other way in the other phenotype. That isn't particularly likely in most cases, though. Often even for phenotypes associated with a single SNP you don't get 100% identification due to random effects and noise in the data (mis-called SNPs, etc.)

The basic idea behind this type of analysis is that it's a classification problem (predicting which phenotype based on SNP). Any classifier (Naive Bayes, Decision Tree, Artificial Neural Network, etc.) could be of use here. Techniques like principal components analysis could help eliminate some SNPs off the bat and make subsequent analysis easier.

You want to look at the sensitivity and specificity of your classification, and maximize both with the fewest number of SNPs. Pretty much any SNP identified this way needs to be verified with further experimentation before concluding "SNP X causes phenotype Y".

ADD COMMENT • link 13.6 years ago by Mrawlins ▴ 430

1

Entering edit mode

Every individual counts as a single data point. The more data you have, the more reliable the results are. The sensitivity and specificity are only as accurate as 1/N at most. Assume you have a 200/N % chance of missing a useful SNP. What's the smallest N you can live with for your experiment? That's where I start with this sort of analysis. A sample size of 20-50 is probably sufficient if you corroborate your results with additional experiments (e.g. inducing disease state using controlled mutations, etc.).

ADD REPLY • link 13.5 years ago by Mrawlins ▴ 430

0

Entering edit mode

thanks for your answer. Do you know how many individuals i would need in each population to make draw any meaningful conclusions?

ADD REPLY • link 13.6 years ago by Andrea_Bio ★ 2.8k

score 2 · Answer 3 · 2010-10-22

Because there could be thousands of small (1 to 100 bp) genetic differences (polymorphisms, insertions, deletions) between your resistant (R) and susceptible (S) strains, it may be necessary to reduce some of that genetic difference by back-crossing. Ugh, long time to see results...

That said, I would consider gene expression differences between the 2 strains as a second source of data of genes affected by those genetic differences. This is exactly what we did for a situation in mouse identical to what you describe. The important findings were comparing the uninfected states of the R and S strains as we learned how R was better primed to handle the challenge.

If you cannot get the SNPs involved (may be long, hard work, lots of mating and sequencing), at least the gene expression data gives you something to report. And these genes are legitimate targets for further research.

score 2 · Answer 4 · 2010-10-22

A short answer is that possessing the complete sequence is helpful but not sufficient.

The classical genetics approach is to cross the two strains together (backcross, intercross, etc) and map the phenotype (susceptibility) to a locus. You then try to refine the locus using congenic strains or other genetic techniques. A classic text on this is Silver (http://www.informatics.jax.org/silver/). One modern approach (used by many groups including ours) is to use gene expression data to refine the phenotype; see the work of Robert Williams' group, or our own papers (Balmain lab) for examples. This is still a very, very hard problem.

If all you have is sequence data, one thing that hasn't been mentioned yet is that polymorphisms that change the protein sequence of a coding exons are better de novo candidates than polymorphisms in non-coding DNA. That's not at all to say that causal polymorphisms must be in exons, but if you have no idea where to look, that's a good place to start.