Question

What is the best GWAS software suitable for extremely large dataset? (ex. Plink, Hail, BGENIE....)

5

Entering edit mode

5.0 years ago

Kim6845 ▴ 50

Hi, Biostars

I am working with large genotype dataset for GWAS, especially UK Biobank dataset. It contains 93,095,623 autosomal SNPs on 50 million individuals. (recorded on bgen v1.2 format file separately for each chromosome, 100GB per a chromosome) (UK biobank: https://www.nature.com/articles/s41586-018-0579-z )

Even if Neale group have made public GWAS result on the genotype dataset, I have to conduct GWAS afresh, because my research is on a new phenotype not included in the original GWAS study. (Neale group GWAS result: http://www.nealelab.is/uk-biobank )

Firstly, I would like to conduct basic quality controls on the dataset. (ex. missing rate, MAF, HWE) Afterwards I would conduct GWAS on it with only one phenotype.

I found some tools that would be appropriate for the procedure. I think that not only selecting tool to use for GWAS but also selecting tool for QC carefully is important, because conducting QC on large dataset requires parsing it several times, resulting in consuming a lot of time. (explanation on necessity of bgen format: https://www.well.ox.ac.uk/~gav/bgen_format/ )

Could you advise me on selecting proper softwares?

Thanks!

For quality control,

qctool (https://www.well.ox.ac.uk/~gav/qctool_v2/ ) pros: qc procedure optimized for bgen format (maybe...)
Plink 2.0 compatible with bgen, compared with plink1.9

For GWAS,

Plink 2.0
Hail scala based scalable GWAS tool, optimized for cluster computing on environments like Google Cloud, AWS etc)
example code: https://github.com/Nealelab/UK_Biobank_GWAS
BGENIE GWAS tool optimized for bgen format https://jmarchini.org/bgenie/

GWAS bgen plink hail QC • 4.4k views

ADD COMMENT • link updated 5.0 years ago by chrchang523 10k • written 5.0 years ago by Kim6845 ▴ 50

score 6 · Answer 1 · 2019-05-13

This depends primarily on two things.

Where do you want to consider genotype posterior probabilities in your QC and analysis, vs. just dosages? The bgen format stores genotype probability triples of the form {P(genotype = AA), P(genotype = AB), P(genotype = BB)}, where A and B are the two alleles. However, most QC and analysis steps collapse this triple down to a single dosage value, equal to the expected count of one of the alleles (so P(genotype = AB) + 2 * P(genotype = BB) for allele B). For both this reason and the efficiency gains that result from only worrying about dosages, plink 2.0's "pgen" file format only supports dosages. Thus, if you are using plink 2.0 as part of your analysis pipeline, if you have any steps which actually care about the raw genotype posterior probabilities, they must happen before conversion-to-pgen.

(Note that, when dosages are sufficient, plink 2.0 is consistently 10-100+ times faster than the bgen-based tools.)

How much do you want to customize the main analysis? Plink 2.0 and qctool/BGENIE support the most common QC operations and types of regression; it sounds like both are sufficient for what you want to do today. However, if you want to perform data exploration beyond "standard GWAS", Hail is the best platform I'm aware of for Biobank-sized datasets.