Why multiple SYMBOLS, Consequences... for Variant Effect Predictor (VEP)
1
0
Entering edit mode
16 months ago
gernophil ▴ 80

Hey everyone,

I have a question about the VEP results. Why are there for some variants multiple features like consequence, gene symbol, ensemble id...? And why does it get more, if I have more samples?

Shouldn't the gene be specified by the position on the genome?

An example is this (around 20 samples, after bcftools +vep-split):

CHROM   POS REF ALT ID  Consequence SYMBOL  Existing_variation  VARIANT_CLASS   Gene
17  81645307    G   A   .   intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.    rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.

The same variant with around 500 samples (including the above 17):

CHROM   POS REF ALT ID  Consequence SYMBOL  Existing_variation  VARIANT_CLASS   Gene
17  81645307    G   A   .   intron_variant&non_coding_transcript_variant,intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant,regulatory_region_variant NPLOC4,NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.,.   rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.,.

One explanation for multiple entries that I could think of could be that a variant can sit in a coding region for one gene and in a regulatory region for another. However, this does not explain, why there's a different amount at different n. Can someone explain this to me? My VCF are called with Haplotypecaller per sample and then merged and the merged VCF is then annotated.

VEP • 1.1k views
ADD COMMENT
2
Entering edit mode
16 months ago
barslmn ★ 2.1k

Ensembl VEP annotates for every allele, gene and transcript. You can flag or pick alleles or transcripts with pick options.

https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick

If you add pick flags you can explode this line with -d option of bcftools split-vep, you can select later the annotations you're interested in with bcftools expressions like -i 'PICK~"1"'

ADD COMMENT
0
Entering edit mode

Could you elaborate this in a little more detail. I get such an output e.g. (already after bcftools +split-vep to make it more readable and I left out some columns):

1   930227  C   T   upstream_gene_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant  SAMD11,SAMD11,SAMD11,SAMD11,SAMD11,SAMD11,SAMD11,SAMD11,SAMD11,SAMD11

How do I know which of these annotations is for what? I haven't found this comma form in the manpages for vep.

ADD REPLY
0
Entering edit mode

If you have the latest bcftools you can use --select argument. If you're going to pick more than one transcript you might want to add -d parameter.

bcftools +split-vep -f "columns" -d --select PICK=1

More about it can be found in the how to guide. https://samtools.github.io/bcftools/howtos/plugin.split-vep.html

ADD REPLY
0
Entering edit mode

Ok, I types a really long answer and then I hit the edit button on another message and it was gone. Maybe it would be nice to have drafts saved here, but that's another topic :). So, let's try again (in a shorter way):

My question is not really about filtering, but more about understanding the output. One of the first answers was: "VEP annotates for every allele, gene and transcript". So, isn't there a way to check, which annotation is for wich? So which is for the gene, which for the alleles and which for the different transcripts? If I understand it correctly, using -d -s PICK=1 simply takes the first entry, but that's not really, what I want to do.

E.g. I also annotated with AlphaMissense and here I have the problem that some entries, even though they show an aa change, don't show an AlphaMissense annotation. So simply picking the first would definitely not work here:

#CHROM  POS REF ALT Consequence SYMBOL  Codons  STRAND  am_class    am_pathogenicity
1   930330  A   G   missense_variant,missense_variant,missense_variant,missense_variant,[truncated] 
SAMD11,SAMD11,SAMD11,SAMD11,[truncated] aAg/aGg,aAg/aGg,aAg/aGg,aAg/aGg,[truncated] 
1,1,1,1,[truncated] .,likely_benign,likely_benign,.,[truncated] .,0.1312,0.1312,.,[truncated]

So, all are missense, but only 2 and 3 have AM annotations. That's why I would rather get all outputs and then check, which one to use, rather then simply picking one.

ADD REPLY
0
Entering edit mode

Ok, I types a really long answer and then I hit the edit button on another message and it was gone. Maybe it would be nice to have drafts saved here, but that's another topic :).

That problem is not exclusive to biostars but to any form you're filling on a web browser. I suggest you do all your editing in a reliable text editor and just copy and paste to form after you finish editing, or you can use something like ghosttext.

My question is not really about filtering, but more about understanding the output. One of the first answers was: "VEP annotates for every allele, gene and transcript". So, isn't there a way to check, which annotation is for wich? So which is for the gene, which for the alleles and which for the different transcripts? If I understand it correctly, using -d -s PICK=1 simply takes the first entry, but that's not really, what I want to do.

You can see that from the annotation itself. You will see multiple genes, transcripts lines for the given input line.

Let's inspect a locus that overlaps two genes:

igv_ss

Let's create a multi-allelic example input for VEP:

1 45340254 . T C,A,G . . .

And let's inspect the results.

| index | pos | allele | consequence | symbol | transcript |
| 1 | 1:45340254-45340254 | C | start_lost | TOE1 | ENST00000372090.6 |
| 2 | 1:45340254-45340254 | A | start_lost | TOE1 | ENST00000372090.6 |
| 3 | 1:45340254-45340254 | G | start_lost | TOE1 | ENST00000372090.6 |
| 4 | 1:45340254-45340254 | C | start_lost | MUTYH | ENST00000372098.7 |
| 5 | 1:45340254-45340254 | A | start_lost | MUTYH | ENST00000372098.7 |
| 6 | 1:45340254-45340254 | G | start_lost | MUTYH | ENST00000372098.7 |
| 7 | 1:45340254-45340254 | C | upstream_gene_variant | MUTYH | ENST00000372104.5 |
| 8 | 1:45340254-45340254 | A | upstream_gene_variant | MUTYH | ENST00000372104.5 |
| 9 | 1:45340254-45340254 | G | upstream_gene_variant | MUTYH | ENST00000372104.5 |

The rows 1,2,3 are output for C, A, G alleles for TOE1 gene, annotations for different input alleles.

The rows 4,5,6 are again annotations for different alleles but also for different gene MUTYH.

And the rows 7,8,9 are annotation for different alleles for the gene MUTYH but this time for a different transcript of the MUTYH.

E.g. I also annotated with AlphaMissense and here I have the problem that some entries, even though they show an aa change, don't show an AlphaMissense annotation. So simply picking the first would definitely not work here:

AlphaMissense might only have annotations for some transcripts. That might be why you're not seeing the results in all of them. I suggest you check the AlphaMissense tsv file you downloaded and see for yourself.

ADD REPLY

Login before adding your answer.

Traffic: 1762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6