Biostar Beta. Not for public use.
Vcf File --> Gene/Protein Mutation Information
5
Entering edit mode
6.3 years ago
newDNASeqer • 630
United States

I used AnnoVar to annotate the VCF files, and it seems to me that the VCF files contain mutation site that only has chromosomal loci/position. I was wondering if there's any tool to transform these mutation information to gene/protein that has the mutated amino acid. For example, I want the final results to show mutations with the corresponding gene names (e.g. p53: Thr 200 --> Glu mutation)

What program can do this with the annotated vcf? THANKS a lot

ADD COMMENTlink
7
Entering edit mode
19 months ago
Seattle, WA USA

If you have a sorted BED file containing genes and their coordinates, you can use BEDOPS bedmap to map positions in a VCF file to those genes.

The overall process is:

  1. Convert the VCF file to BED format with the vcf2bed script
  2. Get genes from somewhere and put them into BED format
  3. Map the converted file to the genes

Conversion

$ vcf2bed < positions.vcf > positions.bed

Obtaining Genes

Now grab some genes and turn them into a BED file. Let's say you are working with mouse (build mm10). We'll take Ensembl gene annotations:

$ wget -O - ftp://ftp.ensembl.org/pub/mnt2/release-71/gtf/mus_musculus/Mus_musculus.GRCm38.71.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > mm10.genes.bed

You'd modify this step as needed for whichever organism and build you are working with. We use gtf2bed here to convert the GTF-formatted Ensembl records to sorted BED, and we then filter the output specifically for genes.

Mapping

We have two sorted, BED-formatted inputs: the VCF-sourced positions and our gene annotations. We're now ready to use bedmap to do mapping:

$ bedmap --echo --echo-map-id positions.bed mm10.genes.bed > answer.bed

The file answer.bed is a sorted BED file that contains the mutation positions and a semi-colon-delimited list of gene IDs which overlap the position:

[ mutation-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ mutation-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
...
[ mutation-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]

Where no genes overlap a mutation, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.

If you want the entire gene record — and not just the ID — use --echo-map in place of --echo-map-id. Other mapping --echo-* options are available, depending on what you want bedmap to report.

ADD COMMENTlink
0
Entering edit mode

I keep running into your answers now that I'm working with variants again... hello! :D

ADD REPLYlink
0
Entering edit mode

Hey, congrats on the new job! Hope you're well.

ADD REPLYlink
3
Entering edit mode
5.4 years ago
Rochester, NY USA

Please take another look at the on-line Annovar documentation, especially regarding gene-based annotation. Also the quick startup guide. Annovar certainly can output cDNA position info and the amino acid substitution. You just have to ask for it. What command did you run to get your output?

ADD COMMENTlink
1
Entering edit mode
20 months ago
swbarnes2 5.7k
United States

You can also give data in vcf format to ensembl's Variant Effect Predictor, if the organism is in their database.

You can find that here:

http://www.ensembl.org/tools.html

It's a bit of a pain to parse; since each gene has multiple transcripts, it will output the same SNP multiple times, to account for its effect in every relevant transcript.

ADD COMMENTlink
0
Entering edit mode
2.7 years ago
GPR • 320
Mexico

A while ago I was recommended the BEDOPs tools, which coupled to a RefFlat-Genes table, will append a gene name per chromosome location.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1