Format VCF for dbSNP upload
1
1
Entering edit mode
8.5 years ago
willgilks ▴ 360

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers. --> done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion. --> probably use SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp --> SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base. --> Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field --> e.g VRT=1 (for an SNV), VRT=2 for an indel etc.Use GATK+SNPeff

Could anyone provide any good tips on excluding and annotating variants appropriately for NCBI ?

I'm looking into all of this today, and will post if I get any solutions.

dbSNP submission format http://www.ncbi.nlm.nih.gov/SNP/docs/dbSNP_VCF_Submission.pdf

formatting vcf gatk dbSNP SNPeff • 2.3k views
ADD COMMENT
1
Entering edit mode
7.2 years ago
willgilks ▴ 360

Answering my own question. Core of the solution is roughly:

## Remove variants with a null alternate allele.
sed '/\,\*/d' basic.f1.${vcf} > naa.basic.f1.${vcf}

## In header lines, add more info to fileformat. Add my laboratory name and ref assembly.
## replace Broad-GATK format variant type info with NCBI-dbSNP format.
## change variant type format from Broad-GATK to NCBI dbSNP.

sed -e 's|##fileformat=VCFv4.1|##fileformat=VCFv4.1\n##fileDate=20160423\n##handle=MORROW_EBE_SUSSEX\n##batch=GILKS_LHM_RG\n##reference=GCA_000001215.4\n##population_id=LHM_RG_hemiclones|g' \
    -e 's|;VariantType=SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=MULTIALLELIC_SNP;set=variant|;VRT=1|g' \
    -e 's|;VariantType=INSERTION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=DELETION.*;set=variant|;VRT=2|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_COMPLEX;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED.Other;set=variant|;VRT=8|g' \
    -e 's|;VariantType=MULTIALLELIC_MIXED;set=variant|;VRT=8|g' \
    -e 's|INFO=<ID=VariantType,Number=1,Type=String,Description="Variant type description">|INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception">|g' naa.basic.f1.${vcf} > format.naa.basic.f1.${vcf}

## Re-add GATK variant type for completness and vcf indexing.

GenomeAnalysisTK -R ${refseq} \
    -T VariantAnnotator \
    -V format.naa.basic.f1.${vcf} \
    -A VariantType \
        -o dbSNP.${vcf}
ADD COMMENT

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6