Question: ID and Variants in VCF
0
Entering edit mode
  1. Can there be more than one variant at a given position within the same VCF file, given that it contains variants of several individuals?

  2. What does the ID field in the body correspond to? Is it the ID of that particular variant? Or is it the ID of the individual that this variant belongs to?

ADD COMMENTlink 14 months ago sanna.aizad • 10 • updated 14 months ago Erin_Ensembl • 370
Entering edit mode
1

Hello sanna.aizad,

These seem like assignment question, are they assignment questions? Have you tried reading the VCF specifications?

ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
0

Yes, I have read the VCF specs several times and I still couldn't figure these out.

ADD REPLYlink 14 months ago
sanna.aizad
• 10
Entering edit mode
1

Are these assignment questions though?

Also, can you please explain to me your understanding of what a variant is? This understanding is critical to answering your first question.

The VCF format can be understood by thinking of it as a 3D matrix: Each row is a variant, each non-fixed field is a sample, and each intersection of variant-sample "cell" is a matrix describing the nature of the specific variant in the specific sample.

This is an example from the specifications doc:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
1 2827694 rs2376870 CGTGGATGCGGGGAC C . PASS SVTYPE=DEL;END=2827708;HOMLEN=1;HOMSEQ=G;SVLEN=-14 GT:GQ 1/1:14

The non-fixed field here is NA00001 and the variant entry is for position 1:2827694. As you can see, the matrix at the intersection is as follows:

        NA00001
GT    1/1
GQ    14

The GT and GQ part is obtained from the FORMAT entry, and is uniform across samples.

ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
0

Thank you for this. I am trying to understand what a variant is, so I don't understand what you mean by assignment.

I have tried to represent the 3D matrix into 2D to understand it better. I have used the example from the VCF specs v4.1.

I have a feeling I may have gotten the Alts wrong. But here is what I have understood:

https://ibb.co/6R0bK3f

ADD REPLYlink 14 months ago
sanna.aizad
• 10
Entering edit mode
1

Looks about right, the confusion you have with "which of the two numbers is the ref allele" is that it's always the 0 that's the ref allele. Usually, in unphased VCF files1, you'd see heterozygous genotypes as 0/1. However, phased VCF entries can show 1|0 (note the | pipe symbol as opposed to the / forward-slash). This means that the genotype for that individual is heterozygous, and that the first allele (1) was derived from the father and the second allele (0) from the mother. It is adding the parental information to the zygosity to get phasing information across.

1: This could change for multi-allelic variants, and will change for non diploid cells (see specs below for example). An unphased entry for a biallelic variant in a diploid organism is easier to understand, and other cases start adding layers of complexity.

From the specs doc:

GT (String): Genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc. Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, are indicated by having only one allele value. A triploid call might look like 0/0/1. If a call cannot be made for a sample at a given locus, ‘.’ must be specified for each missing allele in the GT field (for example ‘./.’ for a diploid genotype and ‘.’ for haploid genotype). The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):

  • / : genotype unphased
  • | : genotype phased
ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
1

By the way, see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)

ADD REPLYlink 14 months ago
RamRS
21k
1
Entering edit mode

To answer question 1: A better way to think of it is that each row in the VCF file is a specific variant location in the genome. For a SNP, there's only 4 possible outcomes for the nucleotide - A,C,T or G. One of these nucleotides is going to be the reference, so only 3 options in the alternate allele column, there may be more than one alternate allele. This would still be classed as the same variant. Sometimes you can have multiple variants at the same location because they are identified by different databases.

Only these columns are necessary in a VCF file:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

Columns after these are non-fixed field as RamRS points out, and commonly contain genotype data, where an individual (e.g. NA00001) is either homozygous for the reference allele 0 | 0, heterozygous 1 | 0 or 0 | 1, or homozygous for the alternate allele 1 | 1, or 2 | 2 if more than 1 alternate allele for example. . Here's an example for a 1000 genomes variant rs11725853 for a select few individuals (called HG#####):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109
4   175642699   rs11725853  G   C,A 100 PASS    GT  0|0 0|0 0|2 2|0 0|0 0|1 0|0 1|1 2|1 2|1 0|2 2|2

To answer question 2: The ID column as stated in the VCF specifications that RamRS linked to state:

"ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant the rs number(s) should be used. No identifier should be present in more than one data record. If there is no identifier available, then the MISSING value should be used. (String, no white-space or semi-colons permitted, duplicate values not allowed.)"

You can actually put anything you want in this field, you can put your own labels in. Sometimes more than one variant is present at a location, e.g. a dbSNP germline variant and a COSMIC somatic variant. They would both be listed here separated by commas. These are not for the individuals, as we see above, these are listed as columns after the standard VCF file header columns.

ADD COMMENTlink 14 months ago Erin_Ensembl • 370
Entering edit mode
1

So, using your example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109
 4   175642699   rs11725853  G   C,A 100 PASS    GT  0|0 0|0 0|2 2|0 0|0 0|1 0|0 1|1 2|1 2|1 0|2 2|2

Would this mean:

 Sample    GT     Allele     Genotype(Ref-ph/unph- Alt)
 HG00096   0|0      G        G-ph-G
 HG00097   0|0      G        G-ph-G
 HG00099   0|2      A        G-ph-A
 HG00100   2|0      G        A-ph-G
 HG00101   0|0      G        G-ph-G
 HG00102   0|1      C        G-ph-C     
 HG00103   0|0      G        G-ph-G
 HG00105   1|1      C        C-ph-C
 HG00106   2|1      C        A-ph-C
 HG00107   2|1      C        A-ph-C
 HG00108   0|2      A        G-ph-A
 HG00109   2|2      A        A-ph-A
ADD REPLYlink 14 months ago
sanna.aizad
• 10
Entering edit mode
0

Yes that looks correct to me!

ADD REPLYlink 14 months ago
Erin_Ensembl
• 370

Login before adding your answer.

Powered by the version 1.8