GATK based variant calling yields excessive Indels
1
0
Entering edit mode
23 months ago
evoclive • 0

Dear all

I have a GATK based workflow for my .fq files:

  1. Trimming and adapter clipping by using Trimmomatic
  2. Read alignment against reference with Bowtie2 (all defaults)
  3. Conversion of .bam to .sam and sorting with Samtools
  4. Then I use GATK AddOrReplaceReadGroups:
java -jar picard.jar AddOrReplaceReadGroups \
I=$file \
O=$outfile \
SORT_ORDER=coordinate \
CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT \
RGID=unique_arbitrary_value \
RGLB= unique_arbitrary_value \
RGPL=ILLUMINA \
RGPU= unique_arbitrary_value \
RGSM= unique_arbitrary_value

Then GATK MarkDuplicates

java -jar $EBROOTPICARD/picard.jar MarkDuplicates \
I=$file \
O=$file.mkd_dup.bam \
M=$file.mkd_dup.txt

Then samtools index $file.mkd_dup.bam Then GATK HaplotypeCaller

gatk --java-options "-Xmx4g" HaplotypeCaller \
   -R REF.fna \
   -I $file.mkd_dup.bam \
   -O $file.g.vcf.gz \
   -ERC GVCF

What I have found was after filtering out loci with too much missing data etc my remaining combined gzipped vcf file had around only 600 SNP markers out of 20,000 remaining – with the rest being INDELS. This seems far too high a percentage of Indels – can anyone suggest improvements to my pipeline?

Incidentally, I merged my vcf.gz files with VCFtools: vcf-merge $gz | bgzip -c > mge.vcf.gz (gave me around 10million identified loci) Because merging with GATK: gatk --java-options "-Xmx4g" CombineGVCFs - created a weird file with ALT alleles called only as “<NON_REF>” (i.e. with no accompanying variant call e.g. “C,<NON_REF>”) and all calls for each sample as “./.”

indels calling GATK variant PICARD • 835 views
ADD COMMENT
1
Entering edit mode
23 months ago

unless I don't understand what you're doing, you shouldn't call your VCF in GVCF mode (-ERC GVCF) if you don't plan to combine many GVCFs file with CombineGVCFs followed by a call to GenotypeGVCFs.

ADD COMMENT
0
Entering edit mode

Hi, well I did initially try to use CombineGVCFs with:

gatk --java-options "-Xmx4g" CombineGVCFs \
   -R REF.fna \
   --variant gatk/AORRG_1.g.vcf.gz \
   --variant gatk/AORRG_10.g.vcf.gz \

but every locus call looks like: CM014959.1 527473 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:1:3:1:0,3,11 ./.:1:3:1:0,3,16 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0

Which looks like nonsense to me. Do I need to use GenotypeGVCFs after that? Thus, I used vcftools.

ADD REPLY
1
Entering edit mode

Which looks like nonsense to me.

This is "just" a statistical block. Read https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-faqs/What_is_a_GVCF_and_how_is_it_different_from_a_%27regular%27_VCF%3F.md

Do I need to use GenotypeGVCFs after that?

yes, and no other tool.

ADD REPLY
0
Entering edit mode

Aaah! Very sorry - thanks for your time. Is there a one-document tutorial for this process outlining a full work-flow and what arguments to use under different circumstances (e.g. sequencing platforms)? Or is it just from the webpages?

ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6