2.2 years ago
To answer question 1:
A better way to think of it is that each row in the VCF file is a specific variant location in the genome. For a SNP, there's only 4 possible outcomes for the nucleotide - A,C,T or G. One of these nucleotides is going to be the reference, so only 3 options in the alternate allele column, there may be more than one alternate allele. This would still be classed as the same variant. Sometimes you can have multiple variants at the same location because they are identified by different databases.
Only these columns are necessary in a VCF file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
Columns after these are non-fixed field as RamRS points out, and commonly contain genotype data, where an individual (e.g. NA00001) is either homozygous for the reference allele 0 | 0, heterozygous 1 | 0 or 0 | 1, or homozygous for the alternate allele 1 | 1, or 2 | 2 if more than 1 alternate allele for example. .
Here's an example for a 1000 genomes variant rs11725853 for a select few individuals (called HG#####):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109
4 175642699 rs11725853 G C,A 100 PASS GT 0|0 0|0 0|2 2|0 0|0 0|1 0|0 1|1 2|1 2|1 0|2 2|2
To answer question 2:
The ID column as stated in the VCF specifications that RamRS linked to state:
"ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant the rs number(s) should be used. No identifier should be present in more than one data record. If there is no identifier available, then the MISSING value should be used. (String, no white-space or semi-colons permitted, duplicate values not allowed.)"
You can actually put anything you want in this field, you can put your own labels in. Sometimes more than one variant is present at a location, e.g. a dbSNP germline variant and a COSMIC somatic variant. They would both be listed here separated by commas. These are not for the individuals, as we see above, these are listed as columns after the standard VCF file header columns.