HaplotypeCaller output - All records are non-variant - Have I gone wrong somewhere?
1
0
Entering edit mode
5.9 years ago
BioinfGuru ★ 1.7k

Hello everyone,

Following the GATK germline short variant discovery instructions, I have written a script that runs the pipeline from fastq pre-processing through to creating a gvcf with HaplotypeCaller. On inspection I have noticed that my GVCF:

1) Does not have a ##reference entry in the header even though i did pass a reference in the command

--reference /NGS/musRefs_10/gatk/ref/ensembl92/Mus_musculus.GRCm38.dna.toplevel.fa

2) Every record except 1 contains "END=" for example:

9 36758659 . C NON_REF . . END=36758689 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,38

The 1 record does not contain "END=":

"9 36758714 rs45725377 T C,<non_ref> 22.58 . DB;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1"

I have read What is a GVCF and how is it different from a 'regular' VCF? which states:

The second thing to look for is the END tag in the INFO field of non-variant block records.

Is it normal that every entry except for 1 is a non-variant block record? No errors are being returned. The only thing I can think of is that I ran the script on too small a test subset of the whole data. The test data is 2 fastq files of 10,000 lines.

I'm really not sure where I went wrong here. I appreciate any guidance.

Regards, Kenneth

SNP haplotypecaller NON_REF gatk • 2.4k views
ADD COMMENT
2
Entering edit mode
5.9 years ago

Hello YaGalbi,

10,000 lines in a fastq file means you have 2500 reads per file. That is a very small dataset. Depending on the library prep this means you have a lot of non covered regions or just a very small region of interest. For both variants it is not realy surprising that you just found one position with a variant.

fin swimmer

ADD COMMENT
0
Entering edit mode

Ok finswimmer...thanks for that. It is a whole exome dataset and I use gene coords as the intervals option.

I just created a larger subset of 1,000,000 lines and that should be done in a few hours so hopefully. Ill see a proportional increase.

EDIT: That didnt take as long as I thought - 30min (nice optimizing!). 6142 variants returned from 250000 reads. I think there is no problem.

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6