hello every one , I am new in bioinformatics i need to analyze my SNPs by SIFT the required format chromosome coordinates i have rs numbers for my SNPs in FASTA format how can I get the chromosome coordinates
hello every one , I am new in bioinformatics i need to analyze my SNPs by SIFT the required format chromosome coordinates i have rs numbers for my SNPs in FASTA format how can I get the chromosome coordinates
Hello Ebtihal Kamal,
i have rs numbers for my SNPs in FASTA format
that's weird. rs numbers are just identifiers for dbSNP. FASTA describes a sequence.
To get the coordinates for a given set of rs numbers use Ensembl's VEP (hg19, hg38). It also outputs the SIFT prediction.
fin swimmer
One way to do this in a scalable way is to parallelize searches of a VCF build that has been converted to BED, split by chromosome, and written to compressed archives. Compressed archives of the v151 SNP set will take up about 10Gb.
Preparing files
Here's a makefile
for preparing archive files:
SHELL := /bin/bash
CWD=$(shell pwd)
# cf. ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/
VCF=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz
all: All_20180418 split_all_20180418_by_chr
#
# starch
#
All_20180418:
module add bedops; \
wget -qO- ${VCF} \
| gunzip -c - \
| convert2bed --input=vcf --sort-tmpdir=$(CWD) - \
| awk '{ print "chr"$$0; }' - \
| starch --omit-signature - \
> All_20180418.starch
split_all_20180418_by_chr:
for chr in `unstarch --list-chr All_20180418.starch`; do \
echo $${chr}; \
starchstrip --include $${chr} All_20180418.starch > All_20180418.$${chr}.starch; \
done
Parallel searches
Given a file of rs
IDs in query_snps.txt
, you can write results to a subdirectory called parallel_grep_file_results
:
$ ./parallel_grep_by_file_of_snp_ids.sh $(CWD)/query_snps.txt $(CWD)/parallel_grep_file_results email@example.com
At the end of the search, whether it ends in success or failure, an email is sent to email@example.com
(change, as needed). The output directory contains a BED file with all matches.
Parallel script
The parallel_grep_by_file_of_snp_ids.sh
script might look something like the following. This script needs a SLURM job scheduler and an Environment Modules system that lets you load modules. Change rootStarchDir
, partition
and other variables as needed, depending on where you store files, what partitions you have set up, etc.
#!/bin/bash
queryFn=$(realpath ${1})
resultsDir=$(realpath ${2})
contactEmail=${3}
function joinBy { local IFS="$1"; shift; echo "$*"; }
module add bedops
rootStarchDir=/net/seq/data/genomes/informatics/GRCh38/dbSNP/v151
rootStarchPrefix=${rootStarchDir}/All_20180418
rootStarchFn=${rootStarchPrefix}.starch
partition=queue0
memPerCPU=8000
if [ -d ${resultsDir} ]; then
echo "Error: Results directory already exists!"
exit 1
fi
mkdir -p ${resultsDir}
jobIds=()
for chr in `unstarch --list-chr ${rootStarchFn}`; do
echo ${chr}
jobId=`sbatch --mem-per-cpu=${memPerCPU} --parsable --partition=${partition} --job-name="parallel_grep_dbSNP_${chr}" --output=${resultsDir}/${chr}.out --error=${resultsDir}/${chr}.err --wrap="${rootStarchDir}/parallel_grep_file.sh ${queryFn} ${rootStarchPrefix}.${chr}.starch ${resultsDir}/${chr}.hits"`
jobIds+=("${jobId}")
done
dependencies=$(joinBy : "${jobIds[@]}")
sbatch --mem-per-cpu=${memPerCPU} --mail-type=END --mail-user=${contactEmail} --partition=${partition} --job-name="parallel_grep_dbSNP_reduce" --output=${resultsDir}/reduce.out --error=${resultsDir}/reduce.err --dependency=afterok:${dependencies} --wrap="${rootStarchDir}/parallel_grep_file_reduce.sh ${rootStarchFn} ${resultsDir}"
The reduce script parallel_grep_file_reduce.sh
:
#!/bin/bash
rootStarchFn=$(realpath ${1})
resultsDir=$(realpath ${2})
resultsFn=${resultsDir}/results.bed
module add bedops
for chr in `unstarch --list-chr ${rootStarchFn}`; do
cat ${resultsDir}/${chr}.hits >> ${resultsFn}.tmp
done
sort-bed --max-mem 6G --tmpdir ${resultsDir} ${resultsFn}.tmp > ${resultsFn}
# cleanup
rm -f ${resultsDir}/*.out ${resultsDir}/*.err ${resultsDir}/*.hits ${resultsFn}.tmp
This is a "map-reduce" search across nuclear chromosomes. The per-chromosome search results are merged (reduced) at the end.
If you talking about chromosome coordinates. You can either get from GTF file or a BED file for the species you working, depending upon if they are completely sequenced and annotated or not.
Here's another solution:
0. Prerequisites
1. Get dbSNP
For hg19:
$ wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz
For hg38:
$ wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
2. Convert vcf
to bcf
$ bcftools convert -Ob -o dbSNP_b151.bcf.gz 00-All.vcf.gz
3. Index bcf
file
$ bcftools index dbSNP_b151.bcf.gz
4. Query bcf
file for SNPs and convert output to bed
$ parallel -j 4 'bcftools filter -i "ID=@snps.txt" -r {} dbSNP_b151.bcf.gz' ::: {1..22} X Y MT | grep -v "^#" | convert2bed --input=vcf > result.bed
Adopt the value for -j
to define how many jobs should run in parallel.
snps.txt
is the file which contains th rs IDs (one per line).
fin swimmer
Hi, Ebtihal Kamal.
You can use browser tool to convert your RS ids to chromosome and positions.
There are two assemblies (build 150):
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you post some example data, here? Ebtihal Kamal
https://rest.ensembl.org/vep/human/id/rs56116432?content-type=application/json;refseq=1
Output should give you sift, polyphen prediction for each ID in json format. Format json for rsID, coordinates, transcript, change and prediction. Ebtihal Kamal. Since these are supposed to be clinically significant (as you are looking for sift score), you should try clinvar vcf for these IDs. Clinvar vcf will have multiple levels of information for the IDs.