Question

chromosome coordinate, rs

2

Entering edit mode

5.6 years ago

Ebtihal Kamal ▴ 20

hello every one , I am new in bioinformatics i need to analyze my SNPs by SIFT the required format chromosome coordinates i have rs numbers for my SNPs in FASTA format how can I get the chromosome coordinates

SNP • 2.2k views

ADD COMMENT • link updated 5.6 years ago by Sasha Fokin ▴ 80 • written 5.6 years ago by Ebtihal Kamal ▴ 20

0

Entering edit mode

Could you post some example data, here? Ebtihal Kamal

ADD REPLY • link 5.6 years ago by cpad0112 21k

0

Entering edit mode

rs756051270 
rs1001844648 
rs780156389 
rs749333159 
rs1247965121 
rs1165194803 
rs200324855 
rs1487182042 
rs778900428

ADD REPLY • link updated 5.6 years ago by finswimmer 16k • written 5.6 years ago by Ebtihal Kamal ▴ 20

0

Entering edit mode

https://rest.ensembl.org/vep/human/id/rs56116432?content-type=application/json;refseq=1

Output should give you sift, polyphen prediction for each ID in json format. Format json for rsID, coordinates, transcript, change and prediction. Ebtihal Kamal. Since these are supposed to be clinically significant (as you are looking for sift score), you should try clinvar vcf for these IDs. Clinvar vcf will have multiple levels of information for the IDs.

ADD REPLY • link 5.6 years ago by cpad0112 21k

score 4 · Answer 1 · 2018-08-28

4

Entering edit mode

5.6 years ago

finswimmer 16k

Hello Ebtihal Kamal,

i have rs numbers for my SNPs in FASTA format

that's weird. rs numbers are just identifiers for dbSNP. FASTA describes a sequence.

To get the coordinates for a given set of rs numbers use Ensembl's VEP (hg19, hg38). It also outputs the SIFT prediction.

fin swimmer

ADD COMMENT • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

thank you for your help I tried that, but Ensembl's VEP (hg19, hg38) but there is a message telling me that the web site is blocked

ADD REPLY • link 5.6 years ago by Ebtihal Kamal ▴ 20

0

Entering edit mode

Hm, is ensembl itself reachable? What about BioMart?

ADD REPLY • link 5.6 years ago by finswimmer 16k

score 3 · Answer 2 · 2018-08-28

One way to do this in a scalable way is to parallelize searches of a VCF build that has been converted to BED, split by chromosome, and written to compressed archives. Compressed archives of the v151 SNP set will take up about 10Gb.

Preparing files

Here's a makefile for preparing archive files:

SHELL := /bin/bash
CWD=$(shell pwd)

# cf. ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/
VCF=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz

all: All_20180418 split_all_20180418_by_chr

#
# starch
#

All_20180418:
        module add bedops; \
        wget -qO- ${VCF} \
                | gunzip -c - \
                | convert2bed --input=vcf --sort-tmpdir=$(CWD) - \
                | awk '{ print "chr"$$0; }' - \
                | starch --omit-signature - \
                > All_20180418.starch

split_all_20180418_by_chr:
        for chr in `unstarch --list-chr All_20180418.starch`; do \
                echo $${chr}; \
                starchstrip --include $${chr} All_20180418.starch > All_20180418.$${chr}.starch; \
        done

Parallel searches

Given a file of rs IDs in query_snps.txt, you can write results to a subdirectory called parallel_grep_file_results:

$ ./parallel_grep_by_file_of_snp_ids.sh $(CWD)/query_snps.txt $(CWD)/parallel_grep_file_results email@example.com

At the end of the search, whether it ends in success or failure, an email is sent to email@example.com (change, as needed). The output directory contains a BED file with all matches.

Parallel script

The parallel_grep_by_file_of_snp_ids.sh script might look something like the following. This script needs a SLURM job scheduler and an Environment Modules system that lets you load modules. Change rootStarchDir, partition and other variables as needed, depending on where you store files, what partitions you have set up, etc.

#!/bin/bash

queryFn=$(realpath ${1})
resultsDir=$(realpath ${2})
contactEmail=${3}

function joinBy { local IFS="$1"; shift; echo "$*"; }

module add bedops

rootStarchDir=/net/seq/data/genomes/informatics/GRCh38/dbSNP/v151
rootStarchPrefix=${rootStarchDir}/All_20180418
rootStarchFn=${rootStarchPrefix}.starch
partition=queue0
memPerCPU=8000

if [ -d ${resultsDir} ]; then
    echo "Error: Results directory already exists!"
    exit 1
fi
mkdir -p ${resultsDir}

jobIds=()
for chr in `unstarch --list-chr ${rootStarchFn}`; do
    echo ${chr}
    jobId=`sbatch --mem-per-cpu=${memPerCPU} --parsable --partition=${partition} --job-name="parallel_grep_dbSNP_${chr}" --output=${resultsDir}/${chr}.out --error=${resultsDir}/${chr}.err --wrap="${rootStarchDir}/parallel_grep_file.sh ${queryFn} ${rootStarchPrefix}.${chr}.starch ${resultsDir}/${chr}.hits"`
    jobIds+=("${jobId}")
done

dependencies=$(joinBy : "${jobIds[@]}")
sbatch --mem-per-cpu=${memPerCPU} --mail-type=END --mail-user=${contactEmail} --partition=${partition} --job-name="parallel_grep_dbSNP_reduce" --output=${resultsDir}/reduce.out --error=${resultsDir}/reduce.err --dependency=afterok:${dependencies} --wrap="${rootStarchDir}/parallel_grep_file_reduce.sh ${rootStarchFn} ${resultsDir}"

The reduce script parallel_grep_file_reduce.sh:

#!/bin/bash

rootStarchFn=$(realpath ${1})
resultsDir=$(realpath ${2})
resultsFn=${resultsDir}/results.bed

module add bedops

for chr in `unstarch --list-chr ${rootStarchFn}`; do
    cat ${resultsDir}/${chr}.hits >> ${resultsFn}.tmp
done

sort-bed --max-mem 6G --tmpdir ${resultsDir} ${resultsFn}.tmp > ${resultsFn}

# cleanup

rm -f ${resultsDir}/*.out ${resultsDir}/*.err ${resultsDir}/*.hits ${resultsFn}.tmp

This is a "map-reduce" search across nuclear chromosomes. The per-chromosome search results are merged (reduced) at the end.

score 0 · Answer 3 · 2018-08-28

0

Entering edit mode

5.6 years ago

sangram_keshari ▴ 260

If you talking about chromosome coordinates. You can either get from GTF file or a BED file for the species you working, depending upon if they are completely sequenced and annotated or not.

ADD COMMENT • link 5.6 years ago by sangram_keshari ▴ 260

score 0 · Answer 4 · 2018-08-28

Here's another solution:

0. Prerequisites

1. Get dbSNP

For hg19:

$ wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz

For hg38:

$ wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz

2. Convert vcf to bcf

$ bcftools convert -Ob -o dbSNP_b151.bcf.gz 00-All.vcf.gz

3. Index bcf file

$ bcftools index dbSNP_b151.bcf.gz

4. Query bcf file for SNPs and convert output to bed

$ parallel -j 4 'bcftools filter -i "ID=@snps.txt" -r {} dbSNP_b151.bcf.gz' ::: {1..22} X Y MT | grep -v "^#" | convert2bed --input=vcf > result.bed

Adopt the value for -j to define how many jobs should run in parallel.

snps.txt is the file which contains th rs IDs (one per line).

fin swimmer

score 0 · Answer 5 · 2018-08-29

0

Entering edit mode

5.6 years ago

Sasha Fokin ▴ 80

Hi, Ebtihal Kamal.

You can use browser tool to convert your RS ids to chromosome and positions.

There are two assemblies (build 150):

HG19 / GRCh37 (234,756,361 SNPs)
HG38 / GRCh38 (335,322,261 SNPs)

ADD COMMENT • link 5.6 years ago by Sasha Fokin ▴ 80