Deleted:Questions about learning to understand my 23andme imputed genome (VCF files)
0
0
Entering edit mode
2.9 years ago

Hey,

I recently got access to my own imputed genome from 23andme (v6 imputation), and decided to teach myself a bit of bioinformatics using it. I have a basic background in genetics, but am more of a CS person.

I started by downloading the raw BCF files that 23andme provides for export, and took a look at the files. I have the following questions about the data/file format/understanding of genomics:

1) When I downloaded my imputed genome I had data for chromosomes 1-22, and X, but not my mtDNA or Y chromosomes (I am a male, and this data appeared in my raw .txt file data). I'm wondering why this data is absent from my imputed genome (was downloaded directly from 23andme's archive portal for my profile)? Are the mtDNA and Y chromosome processed separately during 23andme's processing?

2) The BCF (VCF) files that I downloaded have the following structure:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1

chrX 3617243 . T C . PASS . HDS 0,0

Am I correct in understanding that the position refers to the bp offset of the chromosome (locus), and the "REF" allele is the default template value from the reference human genome 23andme uses? Therefore the "ALT" column is the genotype that I actually have?

3) What exactly is the HDS score at a simple level? I at a really basic level understand it to be a frequency associated with variant calls that have been imputed - and it corresponds to what fraction of the population has that imputed gene? I feel like I'm not fully understanding this. Do rows that have "0,0" indicate that the data is not imputed?

4) 23andme further divided my X chromosome into three parts, 2 pseudoautosomal regions which contain heterozygous alleles (didn't know about this before, but cool to learn that this is a thing!). There is a third file (labeled "nonpar") which I assume contains the rest of my X chromosome. In that file I have rows that look like this:

//2 variants at the same place?

chrX 3617243 . T C . PASS . HDS 0,0

chrX 3617243 . T TATAC . PASS . HDS 0,0

// 2 variants but as a result of imputation?

chrX 2801739 . C G . PASS . HDS 0.92,0.92

chrX 2801739 . C T . PASS . HDS 0,0

// another example of 2 alleles with some sort of imputed value?

chrX 2783144 . C T . PASS . HDS 0.94,0.94

chrX 2783144 . C A . PASS . HDS 0.1,0.1

// 4 variants???

chrX 3597495 . A AAC . PASS . HDS 0,0

chrX 3597495 . A AAAAAAAAC . PASS . HDS 0,0

chrX 3597495 . A AC . PASS . HDS 0,0

chrX 3597495 . A C . PASS . HDS 0,0

Which seems to refer to alleles at the same locus with different genotypes for me? If this was on chromosomes 1-22 I would interpret this as having 1 allele from each parent (for each copy of the chromosome). Since I only have 1 X chromosome (and I believe these variants are not in the par regions since they would have been in a different file) interpreting this is more confusing.

I see this same pattern on other chromosomes like chromosome 1, where this interpretation makes sense:

chr1 54829 . G T . PASS . HDS 0,0

chr1 54829 . G A . PASS . HDS 0,0

I found a brief explanation of multi-allelic sites here (https://gatk.broadinstitute.org/hc/en-us/articles/360035890771-Biallelic-vs-Multiallelic-sites) but it talks about them in terms of "samples in a cohort", which I don't understand since all of these samples should just be from me (unless this means my sample was just really contaminated or something).

Have I just totally misunderstand what these rows represent?

Thank you for reading this far if you made it! I would appreciate answers to any/all of these questions!

23andme HDS bcf vcf alleles • 1.1k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6