What Type Of Information Does The Data Downloaded From The 1000 Genomes Contain?
1
1
Entering edit mode
11.5 years ago
Pappu ★ 2.1k

With this command, I only get a vcf file containing numbers:

tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:62379194-62382592 >1000.vcf

Could you tell me what is going on wrong or how to interpret the output (first 11 columns from awk)?

11 62379317 rs181324718 G T 100 PASS ERATE=0.0004;LDAF=0.0013;AA=G;AN=2184;VT=SNP;THETA=0.0006;RSQ=0.7300;SNPSOURCE=LOWCOV;AC=2;AVGPOST=0.9992;AF=0.0009;ASN_AF=0.0035 GT:DS:GL 0|0:0.000:-0.01,-1.79,-5.00 0|0:0.000:-0.04,-1.06,-5.00

11 62379455 rs146374152 C T 100 PASS ERATE=0.0004;AA=C;AN=2184;VT=SNP;RSQ=0.5436;THETA=0.0056;SNPSOURCE=LOWCOV;AC=1;AVGPOST=0.9992;LDAF=0.0008;AF=0.0005;AFR_AF=0.0020 GT:DS:GL 0|0:0.000:-0.10,-0.69,-4.70 0|0:0.000:-0.00,-2.48,-5.00

11 62379545 rs139680986 C T 100 PASS ERATE=0.0004;AA=C;AN=2184;VT=SNP;THETA=0.0006;LDAF=0.0007;SNPSOURCE=LOWCOV;AC=1;RSQ=0.6432;AVGPOST=0.9995;AF=0.0005;ASN_AF=0.0017 GT:DS:GL 0|0:0.000:-0.01,-1.49,-5.00 0|0:0.000:-0.00,-2.59,-5.00
mutation • 3.3k views
ADD COMMENT
0
Entering edit mode

Can you show us some of the output?

ADD REPLY
0
Entering edit mode

There should be some # in the "head" of this vcf file to walk you through these values.. normally as one would hope the vcf file will have the variants calls for all the samples in columns. Perhaps a column count, row count and walking through the # should offer some insight.

ADD REPLY
6
Entering edit mode
11.5 years ago
Chris Whelan ▴ 570

This is hard to read because tabs aren't showing up here, but the first few fields describe the variant, a SNP (VT=SNP) at either chr 1 pos 162379317 or chr11 pos 62379317 (depending on the placement of a tab!) The SNP had ID rs181324718 and changes a G in the reference to a T in the alternate alleles.

The numeric values that start occurring after the "GT:DS:GL" are the genotype (GT), downsampling (DS) and genotype likelihood (GL) values for each of the samples in the 1000 Genomes Project. It's hard to see here because you've lost tabs in your formatting; each one is of the format:

0|0:0.000:-0.01,-1.72,-5.000

In this case, 0|0 is the genotype for a particular sample; 0 means the reference allele and 1 means the alternate allele so this sample is homozygous for the reference allele. If the sample was called as heterozygous, for example it might be 1|0 in that field, and homozygous would be 1:1. I don't know the details of the DS value but it has to do with how the call was made and whether they used all of the reads for that sample. The three numbers separated by commas are the log likelihoods of the AA, AB, and BB genotypes.

The header of the VCF should tell you which sample goes with which column.

See http://www.1000genomes.org/node/101 for more details.

ADD COMMENT
0
Entering edit mode

Hi

This is a great answer. A quick not to say readme file does explain some of these tags

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/README.phase1_integrated_release_version3_20120430

If you add the option -h to your tabix command you will also get the header which should contain full documentation for all the other tags

ADD REPLY

Login before adding your answer.

Traffic: 3918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6