Comparing Genotypes from NA12878 to that of her parents (NA12891 and NA12892)
1
4
Entering edit mode
9.1 years ago
sichan ▴ 90

Hello,

I'm interested in comparing the genotypes from Genome in a Bottle's NA12878 (GIAB) to those of her parents (NA12891 and NA12892).

I downloaded GIAB's NA12878 vcf from here.

After a lot of searching around, I found this page from Broad describing a vcf containing the genotypes for the trio.

And I downloaded the variants for NA12891 and NA12892 from here.

In total, there are ~3.3 million variants in the GIAB vcf. I compared the alternate alleles and genotypes in the GIAB vcf with the corresponding values in her parents and found that ~27% of the positions had parental genotypes that didn't make sense.

e.g. a position in the daughter is genotyped as 1/1, but the father is 0/1 and the mother is 0/0. That is, it's impossible for the daughter to be 1/1 if her parents are 0/1 and 0/0.

I'm aware that the GIAB vcf has gone through a lot more curation than those of her parents, so perhaps that accounts for the discrepancy?

I'm pretty sure I'm using the correct files, but if anyone thinks otherwise, please let me know.

Thank you.

SNP next-gen genome • 8.8k views
ADD COMMENT
2
Entering edit mode

You need to make sure to restrict your analyses to the high confidence regions provided in the NIST bed file.

ADD REPLY
0
Entering edit mode

According to the README from Genome in a Bottle, the VCF contains highly confident hetero- and homozygous variant calls, thus implying that those variants are in highly confident regions. Any position in the confident BED file but not the VCF can be confidently treated as homozygous reference.

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/README.GIAB.v0.2.txt

As a quick sanity check, I previously used bedtools to confirm that there were zero positions in the VCF that were not in the confident BED file.

ADD REPLY
2
Entering edit mode
9.1 years ago
Len Trigg ★ 1.6k

As well as the fact that you should get better mendelian consistency by restricting to the high confidence regions, much of the discrepancy may be due to differences in variant representation between the NIST set and the what is produced by GATK. As noted here:

IMPORTANT NOTE: Some differences between the integrated calls and your datasets are likely due to different representations of the same complex variants, so be careful about this. In our experience, for some datasets, over half of the putative false positive snps and indels can be due to different correct representations of complex variants. Running vcflib vcfallelicprimitives on your vcf should allow proper comparison of all homozygous complex variants, but not all heterozygous complex variants since our calls are currently unphased. Real Time Genomics has freely released their vcf comparison algorithm vcfeval, which can properly compare most unphased heterozygous complex variants. Currently for complex variants, our calls generally use the representation from Real Time Genomics caller.

(RTG Tools includes the vcfeval tool for comparing a call set vs baseline handling the representational difficulties, and also a separate tool for flagging mendelian violations as you have been doing, but AFAIK there isn't something that does both together)

ADD COMMENT
0
Entering edit mode

Len Trigg has a great point about variant normalization. This is especially important for INDELS. Here is another way to normalize variants: http://genome.sph.umich.edu/wiki/Variant_Normalization

ADD REPLY
0
Entering edit mode

Normalization does help somewhat, but there are still plenty of problematic situations with complex variants where you need to go beyond that -- the endgame involves replaying the variants into the reference so that comparisons are carried out at the local haplotype level. See: http://www.slideshare.net/GenomeInABottle/140127-rtg-vcfeval-vcf-comparison-tool

AFAIK, only RTG vcfeval and possibly the Java version of SMaSH do this with any degree of sophistication.

ADD REPLY

Login before adding your answer.

Traffic: 1407 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6