Question

How To Analyze Imputed Gwas Data

30

Entering edit mode

13.1 years ago

Stephen 2.8k

I received genome-wide association (GWAS) data from a colleague who's supposedly done all the imputation and quality control according to the consortium's standards. Genotyping was Illumina 660, imputed to HapMap (3.2 million SNPs total).

The data came to me as a matrix of 11,000 samples (rows) and 3.2 million SNPs (columns). There's a header row for each SNP, and genotypes are coded as the number of minor alleles (or allele dosage for imputed SNPs).

Here's a few rows and columns to show you what it looks like:

rs1793851 rs9929479 rs929483
      2.0         0        1
      1.6         0        1
      2.0         NA       0
      2.0         0        1
      1.6         0        0
      2.0         1        NA
      1.0         0        0
      1.9         0        2

I've always used PLINK for GWAS data management, QC, and analysis because of its efficient data handling capabilities for GWAS data. However, this kind of data can't be imported directly into PLINK or converted into a pedigree format file. (PLINK does handle imputed data, and so does SNPTEST, but both of these require genotype probabilities and I only have the expected allele dosage).

I did write some R code to read in the data in chunks and run some simple summary and association statistics, but this is clunky and suboptimal for many reasons:

The dataset first has to be split up (I used a perl wrapper around UNIX/cut to do this). After splitting the dataset into several hundred files with all my samples and a subset of SNPs, computing sample-level measures (sample call rate, relatedness, ethnic outliers) is going to be a real coding nightmare.
Subsetting analyses is going to be difficult (not as easy as PLINK's --exclude, --include, --keep, --remove, --cluster, etc).
PLINK integrates SNP annotation info (in the map file) to your results. Joining QC and analysis results to genomic position, minor allele, etc, will require lots of SQL joins.

Ideally I don't want to rewrite software for GWAS data management, QC, and analysis. I've considered (1) analyzing only genotyped SNPS, or (2) rounding the allele dosage to the nearest integer so I can use PLINK, but both of these methods discard useful data.

Does anyone have any suggestions on how I should start to QC and analyze this data without re-inventing the wheel or rewriting PLINK? Any other software suggestions that could take this kind of data? Keep in mind, my dataset is nearly 100GB.

Thanks in advance.

gwas r snp imputation plink • 43k views

ADD COMMENT • link updated 3.8 years ago by Mahan ▴ 70 • written 13.1 years ago by Stephen 2.8k

0

Entering edit mode

Have you tried Beagle and PRESTO? http://faculty.washington.edu/browning/beagle/beagle.html

ADD REPLY • link 13.1 years ago by Jarretinha 3.4k

0

Entering edit mode

Looked through all the beagle utilities and didn't see much that would help.

ADD REPLY • link 13.1 years ago by Stephen 2.8k

0

Entering edit mode

Hi Stephen, Have you successfully manage these kind of data? I got a similar matrix (markers by sample genotype). Is there a way to convert it to plink format? Thanks

ADD REPLY • link 7.4 years ago by ShirleyDai ▴ 50

0

Entering edit mode

Just a hint Shirley. This thread is 5.7 years old so I don't know if you'll get an answer from Stephen.

ADD REPLY • link 7.4 years ago by LLTommy ★ 1.2k

0

Entering edit mode

Can you post a sample file? If it has individuals as rows and markers as columns then you can directly read it in plink. Use below options according to your input file Just in case you have markers as rows and individuals as columns, you'll have to transpose

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

and then use "output" file with these options

--no-parents #if you do not have information about parents
 --no-fid    # if you have no family ID
  --no-pheno   # if you have no phenotype information
  --no-sex  #  if you have no sex information
  --compound-genotypes   # if the genotype codes in a input file are in the form AG rather than A G

Hope that helps!

ADD REPLY • link 7.3 years ago by akang ▴ 110

0

Entering edit mode

Hello everyone

I have received the imputed data (from IMPUTE) divided into ch 1 to 22 (.bid, .bim, and .fam) files.

In the next step, I would like to perform the post imputation QC and association analysis.

I have experience working with a single file system earlier, but this is the first time where I have data in ch1~ ch22 file system.

Please let me know if any protocol/tool I can follow to perform the post imputation QC and association analysis.

Thanks in advance M

ADD REPLY • link 3.8 years ago by Mahan ▴ 70

5

Entering edit mode

13.1 years ago

Avsmith ▴ 300

As I mentioned in my previous answer, the packages mentioned above are good at set models, but can't alway do exactly the analysis you'd like to do. When doing that, I typically work via netcdf stored versions of the genotypes. I don't work this way unless necessary, as it is MUCH, MUCH slower compared to these optimized packages which typically can read all the genotypes as one go.

As you've requested an example on how to do this, I've put some code. (I did edit this slightly to remove some local stuff, and I didn't test the edited version. Should be close to working, still).

This code loads imputations from MACH, but hopefully the logic is apparent enough that you can modify as needed.

Then, once all these files are created, the following code does a Cox Regression. (I realize this is a model supported by ProbABEL, but it gives you a sense of how to work with netcdf files.)

library(ncdf)
library(splines)
library(survival)
library(foreign)

FILE <- "data.txt"
pheno <- read.table(FILE, header = T)
# Function to run COX regression
# This the main thing to change when running
# different models
coxfitter <- function(geno, phenodf, formula, init) {
    environment(formula) <- environment()
    formula <- update(formula, . ~ geno + .)
    model <- coxph(formula, init = init, data = phenodf)
    beta <- coef(model)["geno"]
    se <- coef(summary(model))["geno", "se(coef)"]
    pz <- coef(summary(model))["geno", "Pr(>|z|)"]
    return(c(beta = beta, se = se, n = model$n, pz = pz))
}
# Loop over the chromosome.
# Current test form only does chr22
# Change to 'for (CHR in 1:22){' for entire genome
for (CHR in 1:23) {
    # Open NetCDF file
    genonc <- open.ncdf(sub("@", as.character(CHR), "chr@.mldose.ncdf"))
    # Get list of SNPs
    SNPs <- get.var.ncdf(genonc, "rsid")
    # Get list of IDs
    IDs <- get.var.ncdf(genonc, "sampleID")
    # Number of SNPs
    NSNP <- dim(SNPs)
    # Number of Samples
    NID <- dim(IDs)
    # Placeholders for results
    beta <- rep(NA, NSNP)
    se <- rep(NA, NSNP)
    n <- rep(NA, NSNP)
    pz <- rep(NA, NSNP)
    ref <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "ref")]
    alt <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "alt")]
    id <- get.var.ncdf(genonc, "rsid")
    pos <- get.var.ncdf(genonc, "pos")
    # Used to compare order in dosages and input samples
    keep <- match(pheno$IID, IDs)
    # Basic model
    m <- coxph(Surv(FOLLOWUP, PHENO) ~ SEX + AGE, data = pheno)
    formula <- formula(m)
    init <- c(0, coef(m))
    # Get the dosages for the current chromosome
    for (i in 1:NSNP) {
        dose <- get.var.ncdf(genonc, "dose", start = c(i, 1), 
            count = c(1, NID))
        dose <- dose[keep]
        f <- coxfitter(dose, pheno, formula, init)
        beta[i] <- f[1]
        se[i] <- f[2]
        n[i] <- f[3]
        pz[i] <- f[4]
    }
    # Write out the results
    chrom <- rep(CHR, NSNP)
    results <- data.frame(chr = chrom, snp = paste("rs", id, 
        sep = ""), pos, ref, alt, strand = "+", n, beta = signif(beta, 
        5), se = signif(se, 5), p = signif(pz, 5))
    write.table(results, paste(PHENO, ".coxph.", "chr", CHR, 
        ".results", sep = ""), sep = "\t", quote = FALSE, row.names = FALSE)
}

Note: I used ncdf package 1.8 in R. The CRAN repository version (1.6.x) has had a bug where missing values of byte variables were not coming as NA in R. Because of that I moved to 1.8. The 1.6 has been bumped since I found the bug, so I'm not sure whether the bug still exists in 1.6.5 now. The author of the package recommended I use 1.8. R ncdf 1.8 can be found here.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Avsmith ▴ 300

0

Entering edit mode

I should add that I am intrigued by compression options in ncdf4, as these files are HUGE. However, I've never ported over to ncdf4.

ADD REPLY • link 13.1 years ago by Avsmith ▴ 300

0

Entering edit mode

Do you think this approach (via ncdf ) would work with imputed data in .vcf-format? Now that ProbAbel and GenAbel seem to be discontinued, this approach you suggested 7 years ago seems more and more attractive...

ADD REPLY • link 6.0 years ago by OK_ • 0

2

Entering edit mode

13.1 years ago

Genotepes ▴ 950

Looks like you have data input for MACH2DAT program for dosage data. However, I guess ProbABEL can do it as well.

You are supposed to have a file with SNPs description and a phenoytype file.

However, I don't know of any way of reordering ot handling the data - whereas it could be done in IMPUTE.

ADD COMMENT • link 13.1 years ago by Genotepes ▴ 950

score 25 · Accepted Answer · 2011-03-17

For imputation, the primary QC really should be done on the input genotypes, rather than directly on the imputations. Prior to imputation, I remove SNPs at MAF < 0.01, HWE < 1e-06, and only used SNPs present on 97% of the samples. Also, since Illumina data has relatively few A/T and G/C SNPs, I also remove those, as it completely removes any potential strand issues (most of my runs are meta-analyzed with other studies, so I think this step is worthwhile). In my opinion, these steps are a key part of good imputations.

However, you already have imputations, so I will assume that they are to high quality. The exact answer to your question depends on the software used for imputation. Personally, I use MACH for most of my imputation work, and I use ProbABEL for association analysis. ProbABEL reads the MACH files directly without modification. Also MACH2QTL and MACH2DAT can be used similarly, but I don't have any experience with those. In the analysis, ProbABEL does track the Imputation Quality score from the input files, and typically those are filtered at R2<0.3, but that is typically done as part of the meta-analysis.

Some points:

ProbABEL reads the entire file into memory, which will be sizable for large chromosomes and sample size.
The newer versions of ProbABEL have support for a "filevector" format which gets around these issue. I've just started experimenting with this, and it will remove the memory issues I just mentioned. If the documentation is unclear, there is a better explanation here on how to use the files. The "filevector" format needs to be made via the mach2databel function of GenABEL.
To get this all to work, I have a custom job script written in Perl to queue the jobs, and I usually do several in parallel via the Parallel::Forkmanager module, but you should choose your own poison.
ProbABEL can't take compressed files as input, which is not optimal, because the uncompressed files are huge. MACH2QTL and MACH2DAT can take compressed files, but I've still not jumped to that.
If your impuations came from IMPUTE, SNPTEST can take the input directly, but I have no experience with SNPTEST.
The GenABEL package also has a impute2databel function to make filevector format for use with ProbABEL. I've not worked with this.
Previous answers cited association software for use with Beagle imputations. I have no experience with that myself.
These softwares typically allow for fairly standard association models, and if you want "unsupported" models, the work arounds are much slower. I've loaded all my genotypes into netcdf files and work via R with the ncdf library.
I don't recommend working with the "rounded" genotypes, as the decimal values appropriately capture uncertainty in the imputations.
You don't want to read the unrounded genotypes into R, as R does not support floats. Everything is read as double, and memory quickly becomes an issue. That is the reason for the netcdf workaround.

Hopefully, this gives you some insight.

score 7 · Accepted Answer · 2011-07-29

7

Entering edit mode

12.7 years ago

Adairama ▴ 70

Actually PLINK does now support dosage formats. a) You need to set format=1 and b) the dosage column names needs to "FID IID" format.

plink --dosage test.trdose format=1 --fam test.fam

Of course you need to transpose the mldose file (and rename "Al1" to "A1" and "Al2" to "A2"), as you would with the mlprob file. See http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml for other option arguments.