I am working with large genotype dataset for GWAS, especially UK Biobank dataset.
It contains 93,095,623 autosomal SNPs on 50 million individuals. (recorded on bgen v1.2 format file separately for each chromosome, 100GB per a chromosome) (UK biobank: https://www.nature.com/articles/s41586-018-0579-z )
Even if Neale group have made public GWAS result on the genotype dataset, I have to conduct GWAS afresh, because my research is on a new phenotype not included in the original GWAS study. (Neale group GWAS result: http://www.nealelab.is/uk-biobank )
Firstly, I would like to conduct basic quality controls on the dataset. (ex. missing rate, MAF, HWE)
Afterwards I would conduct GWAS on it with only one phenotype.
I found some tools that would be appropriate for the procedure. I think that not only selecting tool to use for GWAS but also selecting tool for QC carefully is important, because conducting QC on large dataset requires parsing it several times, resulting in consuming a lot of time. (explanation on necessity of bgen format: https://www.well.ox.ac.uk/~gav/bgen_format/ )
Could you advise me on selecting proper softwares?
For quality control,
- qctool (https://www.well.ox.ac.uk/~gav/qctool_v2/ )
pros: qc procedure optimized for bgen format (maybe...)
- Plink 2.0 compatible with bgen, compared with plink1.9
scala based scalable GWAS tool, optimized for cluster computing on environments like Google Cloud, AWS etc)
example code: https://github.com/Nealelab/UK_Biobank_GWAS
GWAS tool optimized for bgen format https://jmarchini.org/bgenie/