I have a working script which takes a list of genes and maps all the SNPs with MAF >= 0.05 to each gene using BiomaRt. I have run this for small lists of genes (~250) so far but would like a genome wide solution such that I have a lists of MAF >= 0.05 for all protein coding genes. The current script produces a list of valid rsIDs for each gene (using HGNC IDs) in a folder called {HGNC GENE}_snps
and a log file entry for each gene (see below).
My current method is not ideal as:
- It's a bit slow.
- I regularly receive time outs, I think mainly due to larger genes making the query too lengthy, which means I can't retrieve all the data I need.
Is it possible to download the BioMart databases to run this locally?
Or is there a more efficient way to do this, I don't necessarily need to use BioMart or R.
Here is my script:
library(tidyverse)
library(biomaRt)
library(readxl)
## Set variables ---------------------------------------------------------------------
IMP_GENES <- snakemake@input$genes
LOG_FILE <- snakemake@output
OUT_DIR <- snakemake@params
## Load Data ------------------------------------------------------------------------
genes <- read_excel(IMP_GENES, sheet = 'Human_All_Known IG_NoStatus') %>%
filter(!grepl("placenta", Gene)) %>%
drop_na(Gene) %>%
arrange(Gene) %>%
pull(Gene)
## Use Biormart to extract gene boundaries based on hgnc_symbol --------------------
genemart = useMart(host="www.ensembl.org",
biomart = "ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl")
gene_coord <- getBM(attributes=c("hgnc_symbol",
"chromosome_name",
"start_position",
"end_position",
"gene_biotype"),
filters = c("hgnc_symbol"),
values = list(genes), mart = genemart)
## Use Biormart to filter SNPs in gene boundaries on MAF > 0.05 --------------------
snpmart <- useMart(host="www.ensembl.org",
biomart="ENSEMBL_MART_SNP",
dataset="hsapiens_snp")
sink(paste0(LOG_FILE), append = FALSE, split = TRUE)
for (i in 1:nrow(gene_coord)) {
tryCatch({
gene_ID <- gene_coord$geneID[i]
gene_chr <- gene_coord$chr[i]
gene_start <- gene_coord$start[i]
gene_stop <- gene_coord$stop[i]
cat("\n----------------------------------------------------------\n")
cat(paste0("Retrieving SNPs for: ", gene_ID, " - chr", gene_chr,":", gene_start, "-", gene_stop))
cat("\n----------------------------------------------------------\n")
snps <- getBM(attributes=c("refsnp_id", "allele", "chr_name", "chrom_start", "chrom_end", "minor_allele_freq"),
filters = c("chr_name", "start", "end"),
values = list(gene_chr, gene_start, gene_stop), mart = snpmart)
snps_maf <- snps[ which(snps$minor_allele_freq >= 0.05), ]
cat(paste0("Total loci in ", gene_ID, ": ", nrow(snps)))
cat(paste0("\nTotal loci after MAF filter (>= 0.05): ", nrow(snps_maf)))
cat("\n----------------------------------------------------------\n")
write.table(snps_maf, paste0(OUT_DIR, gene_ID, "_snps"), quote = FALSE, sep = "\t",
col.names = FALSE, row.names = FALSE )
}, error = function(e) {
cat("ERROR: For ", gene_ID, "\n", conditionMessage(e), "\n")
})
}
sink()
And an example of the log entry:
----------------------------------------------------------
Retrieving SNPs for: ZDBF2 - chr2:206274663-206314427
----------------------------------------------------------
Total loci in ZDBF2: 9422
Total loci after MAF filter (>= 0.05): 40
----------------------------------------------------------
----------------------------------------------------------
Retrieving SNPs for: ZFAT - chr8:134477788-134713049
----------------------------------------------------------
ERROR: For ZFAT
Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 1062424 out of -1 bytes received
Many Thanks.