Question

How To Deal With Missing Genotypes In Population Pca Analysis

2

Entering edit mode

11.5 years ago

Alex Stoddard ▴ 190

When a principle component analysis is done on genome-wide SNP data how should missing genotypes be handled?

Naively I can think of two approaches: i) Drop the markers with any missing data - but this loses too much data with a big cohort of samples and relatively random genotyping failure. ii) Set the missing markers to the average of the sample present (assuming each marker is coded as 0,1,2)

Is approach (ii) reasonable? What would be better approaches?

pca genomics population • 6.0k views

ADD COMMENT • link updated 11.5 years ago by zx8754 11k • written 11.5 years ago by Alex Stoddard ▴ 190

score 5 · Answer 1 · 2012-12-05

The process of substituting a reasonable guess for missing data is called imputation and is fairly common practice for large data sets. Packages for performing imputation (using a k-nearest neighbors approach, for example) are available in R. I haven't used any of them recently so I can't comment on which one you should pick.

score 4 · Answer 2 · 2012-12-05

4

Entering edit mode

11.5 years ago

brentp 24k

How many markers do you lose if you drop those with any missing data?

You can set the missing markers to some value. But you may run into problems if there is bias in the missing data. as @Eugen says, inferring a value from KNN would be better than an average.

There's a very simple-to-use R package that will do the imputation for you using KNN: http://www.bioconductor.org/packages/release/bioc/html/impute.html

ADD COMMENT • link 11.5 years ago by brentp 24k

1

Entering edit mode

Is KNN considered appropriate for genotype data and its typical structure? There is much research effort in doing genotype imputation. I am looking for the simplest thing that could possibly work to get my data into a PCA for a first pass. It sounds like the danger with using the average is that it will be biased when data isn't missing a random. Provided I'm using a lot of markers (1000s +) and each marker has only a small percent missingness do I risk much bias?

ADD REPLY • link 11.5 years ago by Alex Stoddard ▴ 190

score 1 · Answer 3 · 2012-12-07

1

Entering edit mode

11.5 years ago

zx8754 11k

AISNPs?

Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples http://www.investigativegenetics.com/content/2/1/1

ADD COMMENT • link 11.5 years ago by zx8754 11k