What are valid VCF symbolic alternate alleles for structural variants?
1
0
Entering edit mode
4.8 years ago

I've noticed custom symbolic alternate alleles for structural variants in a few VCFs. For example, the gnomAD SV VCF

...
##ALT=<ID=CTX,Description="Reciprocal chromosomal translocation">
...
12  60718971    gnomAD_v2_CTX_12_13 N   <CTX>   999 PASS    END=57020218;SVTYPE=CTX;CHR2=13;SVLEN=-1

and the 1000 genomes phase3 integrated call set

...
##ALT=<ID=CN2,Description="Copy number allele: 2 copies">
...
1   668630  esv3584976  G   <CN2>   100 PASS    AC=64;AF=0.0127796

The problem I'm having, though, is that the VCF 4.3 spec reads (emphasis added, 4.2 and 4.1 specs are similar):

1.4.5 Alternative allele field format

Symbolic alternate alleles are described as follows:

##ALT=<ID=type,Description=description>

Structural Variants

In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and must not contain whitespace or angle brackets. The first level type must be one of the following:

  • DEL Deletion relative to the reference
  • INS Insertion of novel sequence relative to the reference
  • DUP Region of elevated copy number relative to the reference
  • INV Inversion of reference sequence
  • CNV Copy number variable region (may be both deletion and duplication)
  • BND Breakend

The CNV category should not be used when a more specific category can be applied. Reserved subtypes include:

  • DUP:TANDEM Tandem duplication
  • DEL:ME Deletion of mobile element relative to the reference
  • INS:ME Insertion of a mobile element relative to the reference

The way I read that, the symbolic alternate allele must start with one of those ID values. i.e. <CTX> should really be something like <BND:CTX> and <CN2> should be something like <CNV:CN2>.

Are <CTX>, <CN2>, etc. actually invalid, or am I misunderstanding the spec?

vcf gnomad 1000 genomes sv specification • 2.3k views
ADD COMMENT
0
Entering edit mode

Yes, I was just about to comment that that question (CTX in INFO/SVTYPE, which is apparently valid) is what led me down the path to this question (CTX in ALT, which may be invalid). Thanks.

ADD REPLY
1
Entering edit mode
4.8 years ago

Immediately after the text you've quoted, the rest of §1.4.5 says

IUPAC ambiguity codes

Symbolic alleles can be used also to represent genuinely ambiguous data in VCF, for example:

##ALT=<ID=R,Description="IUPAC code R = A/G"> ##ALT=<ID=M,Description="IUPAC code M = A/C">

So “The first level type must be one of the following” relates only to the Structural Variants section.

Thus §1.4.5 IMHO allows for arbitrary symbolic alleles, lists all the (first-level) keywords used for the VCF spec's SV representations, and shows examples of using them for IUPAC ambiguity codes.

So you're certainly allowed to invent your own symbolic alleles like <CTX> or <CN2>. Whether you're allowed to use such things for Structural Variants is another question.

IMHO the reasonable interpretation is: of course! Not everybody loves the BND notation, which is admittedly somewhat cumbersome, and it would be ridiculous for the spec to provide facilities to invent your own tags (which VCF has always done) and then claim you can't use them for a particular topic such as SVs.

IMHO that section should be read as saying you can invent your own tags for SV-related activities, but they won't be what the spec declares to be SV notation and so you can't expect general tools to interpret them cleverly like they are able to for the well-known keywords listed in that itemised list.

VCF provides well-known tags that tools will assume mean what the spec says, and hopes that these tags will be suitable for your use. It also provides a framework for invention of new bespoke tags, and if they become popular they will be candidates for tool support and addition to the spec.

These specifications were not carried down from a mountain carved in stone. They are the works of a small and fallible group of people, and their wording is from time to time not particularly clear or precise. So questions like this are good candidates for hts-specs issues, as even if the spec is correct in some particular area, queries like yours indicate that there is room for improvement in clarity.

Lack of clarity is always a bug in the specification. Unfortunately VCF in particular is a very flexible format and its specification is not alway clear or complete…

ADD COMMENT

Login before adding your answer.

Traffic: 1969 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6