Hi.
I am working with GWAS data and have ran in to some batch effect problems.
Due to bad experiment design, case samples(~250) and control samples(~450) ar genotyped on different chips(GSA and omniexpress). There are no matching samples between platforms.
Now, after association analysis with pop strat and other covariates + permutations, multiple, very strong associations arrise with very similiar and high p values/OR. In my mind, this raises some questions.
So far, to deal with batch effect, i have removed all SNP`s which have p < 0.05 when one platform is used as case and the other one as control, disregarding the actual sample phenotypes, and also PLINK 1.9 --test-missing option, which supposedly tests case/control status association with SNP missingness in samples.
My questions:
- I was wondering, if there is any other way to deal with this kind of batch effect, because, i have found that one person asked a question like this on a different forum, bet after a year he received no answer.
- Also, is it possible to do some sort of correction for this kind of batch effect, instead of just excluding supposedly affected SNP`s.
- How would one determine how much of an association is due to batch effect and how much due to actual allele frequency differences in both cohorts.
Any suggestions or advice are highly appreciated. Thank you.
So you remove all the peaks?
I forgot to mention, that the chip, which contains the desired phenotype for association, also contains samples with 2 other disease phenotypes and some control samples(<50) - making the actual chip with case phenotypes ~650. So the idea was to look for allele frequency differences between the chips, rather than specific disease phenotypes.
I would start by cleaning the data using a procedure like in here: https://kbroman.org/qtl2/assets/vignettes/do_diagnostics.html especially the "Array intensity" part. To even better make the data comparable I would select an unrelated evenly spaced list of markers which is in a bit of lower resolution and impute both array calls to this set of pseudo-SNPs, it will smooth the data.
Thank you for your suggestion, i will give it a try.