Question

How to assess performance of a GWAS analysis pipeline

0

Entering edit mode

10 weeks ago

lsy9 ▴ 20

Situation: I'm in the process of setting up a GWAS (Genome-Wide Association Study) pipeline. Currently, I have a Python script that uses Hail software for reading VCF files, performing quality control (QC), applying filters, and conducting GWAS analysis. My primary goal is to ensure that the pipeline accurately identifies true positive variants that are significantly associated with a specific trait.

(I need to evaluate the pipeline quickly, so it would be better if I could run just a small number of GWAS analyses to assess its performance. My goal is not to exhaustively compare various tools but to evaluate whether my pipeline is functioning reasonably well.)

Question: What are the common methods used by researchers to assess the performance of their GWAS pipelines? Are there benchmark datasets available for testing, and what metrics should I consider to evaluate the pipeline's performance effectively?

I tried searching on Google Scholar using keywords like 'GWAS benchmark' and 'GWAS performance test,' but unfortunately, I couldn't find the information I was looking for.

benchmark hail GWAS performance • 332 views

ADD COMMENT • link updated 9 weeks ago by Ram 43k • written 10 weeks ago by lsy9 ▴ 20

score 2 · Accepted Answer · 2024-02-17

first those search queries look good - you might try similar things in the search bar for Biostars, though - questions like this have been asked many times. here are som manuscripts and prior posts i recommend looking through:

others to include are "GWAS protocols" "GWAS quality control metrics" "GWAS QC metrics" and the like. Im going to throw a couple thoughts your way. I dont know of one single tool that will benchmark a whole work flow - but i haven't looked recently. what I can do is tell you things that are definitely done that should give you several ideas.

heuristic 1: 1) Unit Testing - benchmarking step by step. generally, software developers test the robustness of multistep processes in several ways - one of these is unit testing. here, instead of benchmarking the entire pipeline, you benchmark each individual step separately as you go. in terms of just raw development they might be doing this with respect to speed or reliability, but you can also do it in terms of results issued.

take imputation for instance. most imputation algorithms will generate accuracy scores by blinding themselves to some known SNPs and then imputing them then seeing how accurate they were. if you were to go back and look at that, you could, for instance, benchmark several imputation algorithms against one another. here, you might use the number of snvs imputed and the accuracy of each algorithm (as indicated from the masking procedure) to pick an imputation algorithm with best performance.

heuristic 2:

Recovery of known "true positive results": let's take seropositive rheumatoid arthritis as a disease phenotype. A lot of GWA of RA have been done; indeed these have been organized into meta-analyses and redone with vast samples sizes (GWAMA). Nearly all of these studies nominate the HLA region (specifically HLA-DRB). Of those that don't, most or all of them do not separate seronegative and seropositive RA patients as much or as well.

So, if a researcher is doing yet another GWAS of seropositive rheumatoid arthritis, he or she should expecting to see a strong association for HLA-DRB1. If they fail to find that association, the most likely explanation is that they made an error somewhere in their pipeline. Likewise for the other of the top 10 strongest loci. If they fail to identify any, for instance, then something is definitely wrong...

heuristic 3:

3. other QC metrics: the other thing to do I think is take a look at well known QC metrics. like lambda GC for instance to look for genomic inflation, or like HWE to look for bad SNVs, or at LD support to see if a variant's association tracks with others strongly linked to it in the same area. its definitely also standard practice to do sample QC and SNP QC separately as well.

conclusions: one could create an end to end benchmark conceivably. but, it would probably be by linking together lots of heuristics / qc metrics in the first place. so, unfortunately, i dont know a better way forward than to take a good, rigorous look at each step separately.