Biostar Beta. Not for public use.
How To Analyze Imputed Gwas Data
22
Entering edit mode
4.1 years ago
Stephen 2.7k
@Stephen818

I received genome-wide association (GWAS) data from a colleague who's supposedly done all the imputation and quality control according to the consortium's standards. Genotyping was Illumina 660, imputed to HapMap (3.2 million SNPs total).

The data came to me as a matrix of 11,000 samples (rows) and 3.2 million SNPs (columns). There's a header row for each SNP, and genotypes are coded as the number of minor alleles (or allele dosage for imputed SNPs).

Here's a few rows and columns to show you what it looks like:

rs1793851 rs9929479 rs929483
2.0         0        1
1.6         0        1
2.0         NA       0
2.0         0        1
1.6         0        0
2.0         1        NA
1.0         0        0
1.9         0        2


I've always used PLINK for GWAS data management, QC, and analysis because of its efficient data handling capabilities for GWAS data. However, this kind of data can't be imported directly into PLINK or converted into a pedigree format file. (PLINK does handle imputed data, and so does SNPTEST, but both of these require genotype probabilities and I only have the expected allele dosage).

I did write some R code to read in the data in chunks and run some simple summary and association statistics, but this is clunky and suboptimal for many reasons:

1. The dataset first has to be split up (I used a perl wrapper around UNIX/cut to do this). After splitting the dataset into several hundred files with all my samples and a subset of SNPs, computing sample-level measures (sample call rate, relatedness, ethnic outliers) is going to be a real coding nightmare.
2. Subsetting analyses is going to be difficult (not as easy as PLINK's --exclude, --include, --keep, --remove, --cluster, etc).
3. PLINK integrates SNP annotation info (in the map file) to your results. Joining QC and analysis results to genomic position, minor allele, etc, will require lots of SQL joins.

Ideally I don't want to rewrite software for GWAS data management, QC, and analysis. I've considered (1) analyzing only genotyped SNPS, or (2) rounding the allele dosage to the nearest integer so I can use PLINK, but both of these methods discard useful data.

Does anyone have any suggestions on how I should start to QC and analyze this data without re-inventing the wheel or rewriting PLINK? Any other software suggestions that could take this kind of data? Keep in mind, my dataset is nearly 100GB.

gwas r snp imputation plink • 33k views
0
Entering edit mode
0
Entering edit mode

Looked through all the beagle utilities and didn't see much that would help.

22
Entering edit mode
9.9 years ago
Avsmith • 280
@Avsmith1474

For imputation, the primary QC really should be done on the input genotypes, rather than directly on the imputations. Prior to imputation, I remove SNPs at MAF < 0.01, HWE < 1e-06, and only used SNPs present on 97% of the samples. Also, since Illumina data has relatively few A/T and G/C SNPs, I also remove those, as it completely removes any potential strand issues (most of my runs are meta-analyzed with other studies, so I think this step is worthwhile). In my opinion, these steps are a key part of good imputations.

However, you already have imputations, so I will assume that they are to high quality. The exact answer to your question depends on the software used for imputation. Personally, I use MACH for most of my imputation work, and I use ProbABEL for association analysis. ProbABEL reads the MACH files directly without modification. Also MACH2QTL and MACH2DAT can be used similarly, but I don't have any experience with those. In the analysis, ProbABEL does track the Imputation Quality score from the input files, and typically those are filtered at R2<0.3, but that is typically done as part of the meta-analysis.

Some points:

• ProbABEL reads the entire file into memory, which will be sizable for large chromosomes and sample size.
• The newer versions of ProbABEL have support for a "filevector" format which gets around these issue. I've just started experimenting with this, and it will remove the memory issues I just mentioned. If the documentation is unclear, there is a better explanation here on how to use the files. The "filevector" format needs to be made via the mach2databel function of GenABEL.
• To get this all to work, I have a custom job script written in Perl to queue the jobs, and I usually do several in parallel via the Parallel::Forkmanager module, but you should choose your own poison.
• ProbABEL can't take compressed files as input, which is not optimal, because the uncompressed files are huge. MACH2QTL and MACH2DAT can take compressed files, but I've still not jumped to that.
• If your impuations came from IMPUTE, SNPTEST can take the input directly, but I have no experience with SNPTEST.
• The GenABEL package also has a impute2databel function to make filevector format for use with ProbABEL. I've not worked with this.
• Previous answers cited association software for use with Beagle imputations. I have no experience with that myself.
• These softwares typically allow for fairly standard association models, and if you want "unsupported" models, the work arounds are much slower. I've loaded all my genotypes into netcdf files and work via R with the ncdf library.
• I don't recommend working with the "rounded" genotypes, as the decimal values appropriately capture uncertainty in the imputations.
• You don't want to read the unrounded genotypes into R, as R does not support floats. Everything is read as double, and memory quickly becomes an issue. That is the reason for the netcdf workaround.

Hopefully, this gives you some insight.

1
Entering edit mode

Since I wrote this last night, all the GenABEL & ProbABEL links appear to have changed. Will re-edit after the links are all fixed at the new site.

0
Entering edit mode

Thanks Albert. I've written some code to read in SNPs in "chunks" - groups of 10,000 snps at a time. However because all the files are text and because imputed genotypes are, as you said, read as double, this is extremely memory intensive and SLOW.

I'm wondering now if it would be better to look into netCDF or DatABEL.

7
Entering edit mode
9.5 years ago

Actually PLINK does now support dosage formats. a) You need to set format=1 and b) the dosage column names needs to "FID IID" format.

plink --dosage test.trdose format=1 --fam test.fam

Of course you need to transpose the mldose file (and rename "Al1" to "A1" and "Al2" to "A2"), as you would with the mlprob file. See http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml for other option arguments.

0
Entering edit mode

Can you extract certain SNPs for further analyses in the dosage files in plink?

5
Entering edit mode
9.8 years ago
Avsmith • 280
@Avsmith1474

As I mentioned in my previous answer, the packages mentioned above are good at set models, but can't alway do exactly the analysis you'd like to do. When doing that, I typically work via netcdf stored versions of the genotypes. I don't work this way unless necessary, as it is MUCH , MUCH slower compared to these optimized packages which typically can read all the genotypes as one go.

As you've requested an example on how to do this, I've put some code. (I did edit this slightly to remove some local stuff, and I didn't test the edited version. Should be close to working, still).

This code loads imputations from MACH, but hopefully the logic is apparent enough that you can modify as needed.

library(ncdf)
require(ncdf)
imputed <- sub("@", as.character(chromosome), "chr@.dose")
idfile <- "sampleIDs.txt"
snpfile <- sub("@", as.character(chromosome), "chr@.info")
#Number of SNPs
NSNPS <- nrow(snps)
#Number of Inds
NID <- nrow(ids)
# Create the dimensions for SNPs and Inds (2D netcdf)
snpdim <- dim.def.ncdf("snp", "bases", 1:NSNPS)
sampledim <- dim.def.ncdf("id", "th", 1:NID, unlim = TRUE)
# Variatble for IDs (NOTE: netcdf is not good at holding
#   text, so only storing integernames)
varID <- var.def.ncdf("sampleID", "person", dim = sampledim,
missval = 0, prec = "integer")
# Variable for SNPs (NOTE: this will be the position of the
#   SNP)
varSNP <- var.def.ncdf("pos", "bp", dim = snpdim, missval = 0,
prec = "integer")
# Create dimension for dosasges. Float used to save space
#   in files
vargenotype <- var.def.ncdf("dose", "dose", dim = list(snpdim,
sampledim), missval = -1, prec = "float")
# Holderss for alleles, will be entered in bytye
varrefallele <- var.def.ncdf("ref", "base", dim = snpdim,
missval = 0, prec = "byte")
varaltallele <- var.def.ncdf("alt", "base", dim = snpdim,
missval = 0, prec = "byte")
# rsID for SNPs, stored as integers
varsnpid <- var.def.ncdf("rsid", "base", dim = snpdim, missval = 0,
prec = "integer")
# Store both MAF and EAF for tracking.
# Would be easy to calculate on the fly, but kept here from
#   laziness
vareaf <- var.def.ncdf("EAF", "freq", dim = snpdim, missval = -1,
prec = "float")
varmaf <- var.def.ncdf("MAF", "freq", dim = snpdim, missval = -1,
prec = "float")
# Track inmputation quality statistics
varqual <- var.def.ncdf("Quality", "qual", dim = snpdim,
missval = -1, prec = "float")
varrsq <- var.def.ncdf("Rsq", "rsq", dim = snpdim, missval = -1,
prec = "float")
# Flag for whether the SNPs is imputed or not
varimp <- var.def.ncdf("usedimp", "imp", dim = snpdim, missval = -1,
prec = "byte")
# Create output file name
outfname <- sub("@", as.character(chromosome), "chr@.mldose.ncdf")
# Create output ncdf file
outfile <- create.ncdf(outfname, list(varID, varSNP, varrefallele,
varaltallele, varsnpid, vareaf, varmaf, varqual, varrsq,
vargenotype, varimp), verbose = FALSE)
put.var.ncdf(outfile, varSNP, as.integer(snps$POS), start = 1, count = NSNPS) put.var.ncdf(outfile, varsnpid, as.integer(sub("^(rs|is)", "", snps$SNP)), start = 1, count = NSNPS)
put.var.ncdf(outfile, varrefallele, match(snps$Al1, c("A", "C", "G", "T"), nomatch = 0), start = 1, count = NSNPS) put.var.ncdf(outfile, varaltallele, match(snps$Al2, c("A",
"C", "G", "T", "-"), nomatch = 0), start = 1, count = NSNPS)
put.var.ncdf(outfile, vareaf, snps$Freq1, start = 1, count = NSNPS) put.var.ncdf(outfile, varmaf, snps$MAF, start = 1, count = NSNPS)
put.var.ncdf(outfile, varqual, snps$Quality, start = 1, count = NSNPS) put.var.ncdf(outfile, varrsq, snps$Rsq, start = 1, count = NSNPS)
put.var.ncdf(outfile, varID, ids$IID, start = 1, count = NID) # File with genotypes genotypes <- file(imputed, "r") print(paste(NSNPS, "SNPs to process")) # Iterate over file to load for (i in 1:NID) { # Scan in a single line snpi <- scan(genotypes, what = character(0), nlines = 1, quiet = TRUE, skip = 0) # Load the dosages (from MACH imputation, skip the first # two entries. File has one line per ind) put.var.ncdf(outfile, vargenotype, vals = as.double(snpi[3:length(snpi)]), start = c(1, i), count = c(NSNPS, 1)) # Tracking the results if (!(i%%100)) print(paste("individual #", i, " of ", NID, sep = "")) } close(genotypes) close(outfile) } # Loop over the chromosomes to load for (i in 1:23) { print(paste("Starting: chr", i, sep = "")) load.dosages.ncdf(i) print(warnings()) }  Then, once all these files are created, the following code does a Cox Regression. (I realize this is a model supported by ProbABEL, but it gives you a sense of how to work with netcdf files.) library(ncdf) library(splines) library(survival) library(foreign) FILE <- "data.txt" pheno <- read.table(FILE, header = T) # Function to run COX regression # This the main thing to change when running # different models coxfitter <- function(geno, phenodf, formula, init) { environment(formula) <- environment() formula <- update(formula, . ~ geno + .) model <- coxph(formula, init = init, data = phenodf) beta <- coef(model)["geno"] se <- coef(summary(model))["geno", "se(coef)"] pz <- coef(summary(model))["geno", "Pr(>|z|)"] return(c(beta = beta, se = se, n = model$n, pz = pz))
}
# Loop over the chromosome.
# Current test form only does chr22
# Change to 'for (CHR in 1:22){' for entire genome
for (CHR in 1:23) {
# Open NetCDF file
genonc <- open.ncdf(sub("@", as.character(CHR), "chr@.mldose.ncdf"))
# Get list of SNPs
SNPs <- get.var.ncdf(genonc, "rsid")
# Get list of IDs
IDs <- get.var.ncdf(genonc, "sampleID")
# Number of SNPs
NSNP <- dim(SNPs)
# Number of Samples
NID <- dim(IDs)
# Placeholders for results
beta <- rep(NA, NSNP)
se <- rep(NA, NSNP)
n <- rep(NA, NSNP)
pz <- rep(NA, NSNP)
ref <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "ref")]
alt <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "alt")]
id <- get.var.ncdf(genonc, "rsid")
pos <- get.var.ncdf(genonc, "pos")
# Used to compare order in dosages and input samples
keep <- match(pheno\$IID, IDs)
# Basic model
m <- coxph(Surv(FOLLOWUP, PHENO) ~ SEX + AGE, data = pheno)
formula <- formula(m)
init <- c(0, coef(m))
# Get the dosages for the current chromosome
for (i in 1:NSNP) {
dose <- get.var.ncdf(genonc, "dose", start = c(i, 1),
count = c(1, NID))
dose <- dose[keep]
f <- coxfitter(dose, pheno, formula, init)
beta[i] <- f[1]
se[i] <- f[2]
n[i] <- f[3]
pz[i] <- f[4]
}
# Write out the results
chrom <- rep(CHR, NSNP)
results <- data.frame(chr = chrom, snp = paste("rs", id,
sep = ""), pos, ref, alt, strand = "+", n, beta = signif(beta,
5), se = signif(se, 5), p = signif(pz, 5))
write.table(results, paste(PHENO, ".coxph.", "chr", CHR,
".results", sep = ""), sep = "\t", quote = FALSE, row.names = FALSE)
}


Note: I used ncdf package 1.8 in R. The CRAN repository version (1.6.x) has had a bug where missing values of byte variables were not coming as NA in R. Because of that I moved to 1.8. The 1.6 has been bumped since I found the bug, so I'm not sure whether the bug still exists in 1.6.5 now. The author of the package recommended I use 1.8. R ncdf 1.8 can be found here.

0
Entering edit mode

I should add that I am intrigued by compression options in ncdf4, as these files are HUGE. However, I've never ported over to ncdf4.

0
Entering edit mode

Do you think this approach (via ncdf ) would work with imputed data in .vcf-format? Now that ProbAbel and GenAbel seem to be discontinued, this approach you suggested 7 years ago seems more and more attractive...

2
Entering edit mode
9.8 years ago
Genotepes • 940
@Genotepes1556

Looks like you have data input for MACH2DAT program for dosage data. However, I guess ProbABEL can do it as well.

You are supposed to have a file with SNPs description and a phenoytype file.

However, I don't know of any way of reordering ot handling the data - whereas it could be done in IMPUTE.

Similar Posts
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.