Mouse Training Set For Variantrecalibrator
1
1
Entering edit mode
12.1 years ago

I am calling SNPs from mouse whole genome sequences by using GATK.

Right now I'm stuck on the Variant quality score Recalibration because I don't know what to use as a training set for mouse SNPs. Every example that I see concerns human genome analyses and people use both Hapmap and Omni data, in general.

Is there someone doing the same thing in mouse who might know a good training set for this species?

Thanks

mouse gatk snp • 4.0k views
ADD COMMENT
1
Entering edit mode
12.1 years ago

The Mouse Genome Institue has 18 strains in VCF you can use for variant quality score recalibration.

look under DATA Release:

http://www.sanger.ac.uk/resources/mouse/genomes/

ADD COMMENT
1
Entering edit mode

To overcome that, I was thinking to repeat the SNP calling on all the strains I'm using to compare. That way I would follow the exact same steps and parameters for each data.

ADD REPLY
0
Entering edit mode

I've already tried those VCF but it seems that they are an older version of VCF, VCF3, that is no longer supported by GATK. At least that is the error message. I also tried to convert them using vcftools but they are too big and it takes too long.

ADD REPLY
0
Entering edit mode

Welcome to the world of big data. Can I ask you what your end goal is? Perhaps there is another way. I have been working on calling variants in mouse tumors. There are a fair amount of mouse sequences in the short read archive, but if you don't want to convert the vcf you most certainly won't want to deal with raw reads.

ADD REPLY
0
Entering edit mode

There's no problem in converting this file. I just wanted to know if there was another way or another files. I just received the whole-exome sequence for 1 strain and our goal is to call variants, specially SNPs and compare it to some other strains completely sequenced in the Sanger's mouse project, as you mentioned. Actually the file is being converted right now.

ADD REPLY
0
Entering edit mode

Good deal. One thing I would watch out for is pipeline discrepancies. I found that you can get alot of false positives if you don't have control of the backgrounds (what your comparing your exome to)

ADD REPLY
0
Entering edit mode

Still considering this question of Training set, do you use the Sanger VCFs as truth sites or just training?

ADD REPLY

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6