Question

SNP pruning - Linkage disequilibrium measure, r2 (0.2), and minor allele frequency (0.05), why these values?

0

Entering edit mode

5.9 years ago

Volka ▴ 180

Hi all, I am currently learning quality control of GWAS data, and I am at the point of doing population stratification. Here, the pipeline suggests the pruning of SNPs based on a minor allele frequency of <0.05, and an rsquared value of more than 0.2. My question is, why do we use these thresholds of 0.05 and 0.02 respectively?

linkage disequilibrium minor allele frequency • 3.7k views

ADD COMMENT • link updated 5.9 years ago by Kevin Blighe 87k • written 5.9 years ago by Volka ▴ 180

score 2 · Accepted Answer · 2018-05-19

You seem to imply that these are rigid, i.e., fixed, cut-offs that everyone uses, which is not the case. On one hand, whilst there is some solid statistical basis for choosing 0.05 as a p-value cut-off for statistical significance, there is no solid basis for choosing MAF 0.05 (or r2=0.2). People generally modify the cut-offs from experiment to experiment.

The MAF filter, for example, will heavily depend on the sample size and also the ethnic backdrop of your cohort. The r2 calculation for linkage disequilibrium will also depend on these, but also the SNP genotyping density. Both will additionally be augmented by any imputation that has been made.

The MAF cut-off of 0.05 was originally regarded as the boundary between 'rare' and 'common' alleles, with the original erroneous view being that common variants had no role in disease, and indeed many highly statistically significant 'common' variants were dismissed from association studies in the past because authors did not understand how a common allele could have a role in disease. It is only in recent years that we have functional evidence of how these play major roles in disease. For further great insight on this, please read: Rare and common variants: twenty arguments.

On the other hand, studying rare alleles is difficult because they are, by definition, rare, and are not representative in association study populations. Thus, statistical power is frequently lacking. This is why the UK's 100,000 Genomes Project, whose focus is rare disease, is sequencing ~100,000 genomes, i.e., because it is a large number that will capture rare alleles with reasonable power.

As you are conducting population stratification, we are not interested in rare alleles because it is less likely that they will ve representative of a particular population group. These alleles may later become fixed in a particular population group, but, for now, they are not. Then again, we can neither set the MAF filter too high because then population structure will be lost. In studies for population stratification, I have seen the MAF filter set to 0.01, 0.05. 0.1, and all the way up to 0.2.

The r2 LD filter is necessary in order to remove highly correlated variants that would otherwise add minimal extra information, i.e., it helps to define a 'core' set of independent and non-correlated markers that are most representative of your population groupings. Think of the final list of markers as a 'signature' of your population groups - that's effectively what one is aiming to define with population stratification.

A useful experiment, of course, would be to repeat your work with different cut-offs.

Kevin