Question

GATK based variant calling yields excessive Indels

0

Entering edit mode

23 months ago

evoclive • 0

Dear all

I have a GATK based workflow for my .fq files:

Trimming and adapter clipping by using Trimmomatic
Read alignment against reference with Bowtie2 (all defaults)
Conversion of .bam to .sam and sorting with Samtools
Then I use GATK AddOrReplaceReadGroups:

java -jar picard.jar AddOrReplaceReadGroups \
I=$file \
O=$outfile \
SORT_ORDER=coordinate \
CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT \
RGID=unique_arbitrary_value \
RGLB= unique_arbitrary_value \
RGPL=ILLUMINA \
RGPU= unique_arbitrary_value \
RGSM= unique_arbitrary_value

Then GATK MarkDuplicates

java -jar $EBROOTPICARD/picard.jar MarkDuplicates \
I=$file \
O=$file.mkd_dup.bam \
M=$file.mkd_dup.txt

Then samtools index $file.mkd_dup.bam Then GATK HaplotypeCaller

gatk --java-options "-Xmx4g" HaplotypeCaller \
   -R REF.fna \
   -I $file.mkd_dup.bam \
   -O $file.g.vcf.gz \
   -ERC GVCF

What I have found was after filtering out loci with too much missing data etc my remaining combined gzipped vcf file had around only 600 SNP markers out of 20,000 remaining – with the rest being INDELS. This seems far too high a percentage of Indels – can anyone suggest improvements to my pipeline?

Incidentally, I merged my vcf.gz files with VCFtools: vcf-merge $gz | bgzip -c > mge.vcf.gz (gave me around 10million identified loci) Because merging with GATK: gatk --java-options "-Xmx4g" CombineGVCFs - created a weird file with ALT alleles called only as “<NON_REF>” (i.e. with no accompanying variant call e.g. “C,<NON_REF>”) and all calls for each sample as “./.”

indels calling GATK variant PICARD • 835 views

ADD COMMENT • link updated 23 months ago by Pierre Lindenbaum 161k • written 23 months ago by evoclive • 0

score 1 · Answer 1 · 2022-06-15

1

Entering edit mode

23 months ago

Pierre Lindenbaum 161k

unless I don't understand what you're doing, you shouldn't call your VCF in GVCF mode (-ERC GVCF) if you don't plan to combine many GVCFs file with CombineGVCFs followed by a call to GenotypeGVCFs.

ADD COMMENT • link 23 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi, well I did initially try to use CombineGVCFs with:

gatk --java-options "-Xmx4g" CombineGVCFs \
   -R REF.fna \
   --variant gatk/AORRG_1.g.vcf.gz \
   --variant gatk/AORRG_10.g.vcf.gz \

but every locus call looks like: CM014959.1 527473 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:1:3:1:0,3,11 ./.:1:3:1:0,3,16 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0

Which looks like nonsense to me. Do I need to use GenotypeGVCFs after that? Thus, I used vcftools.

ADD REPLY • link 23 months ago by evoclive • 0

1

Entering edit mode

Which looks like nonsense to me.

This is "just" a statistical block. Read https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-faqs/What_is_a_GVCF_and_how_is_it_different_from_a_%27regular%27_VCF%3F.md

Do I need to use GenotypeGVCFs after that?

yes, and no other tool.

ADD REPLY • link 23 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Aaah! Very sorry - thanks for your time. Is there a one-document tutorial for this process outlining a full work-flow and what arguments to use under different circumstances (e.g. sequencing platforms)? Or is it just from the webpages?

ADD REPLY • link 23 months ago by evoclive • 0

1

Entering edit mode

https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

ADD REPLY • link 23 months ago by Pierre Lindenbaum 161k