Dear all
I have a GATK based workflow for my .fq files:
- Trimming and adapter clipping by using Trimmomatic
- Read alignment against reference with Bowtie2 (all defaults)
- Conversion of .bam to .sam and sorting with Samtools
- Then I use GATK AddOrReplaceReadGroups:
java -jar picard.jar AddOrReplaceReadGroups \ I=$file \ O=$outfile \ SORT_ORDER=coordinate \ CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT \ RGID=unique_arbitrary_value \ RGLB= unique_arbitrary_value \ RGPL=ILLUMINA \ RGPU= unique_arbitrary_value \ RGSM= unique_arbitrary_value
Then GATK MarkDuplicates
java -jar $EBROOTPICARD/picard.jar MarkDuplicates \ I=$file \ O=$file.mkd_dup.bam \ M=$file.mkd_dup.txt
Then samtools index $file.mkd_dup.bam Then GATK HaplotypeCaller
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R REF.fna \ -I $file.mkd_dup.bam \ -O $file.g.vcf.gz \ -ERC GVCF
What I have found was after filtering out loci with too much missing data etc my remaining combined gzipped vcf file had around only 600 SNP markers out of 20,000 remaining – with the rest being INDELS. This seems far too high a percentage of Indels – can anyone suggest improvements to my pipeline?
Incidentally, I merged my vcf.gz files with VCFtools: vcf-merge $gz | bgzip -c > mge.vcf.gz (gave me around 10million identified loci) Because merging with GATK: gatk --java-options "-Xmx4g" CombineGVCFs - created a weird file with ALT alleles called only as “<NON_REF>” (i.e. with no accompanying variant call e.g. “C,<NON_REF>”) and all calls for each sample as “./.”
Hi, well I did initially try to use CombineGVCFs with:
but every locus call looks like:
CM014959.1 527473 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:1:3:1:0,3,11 ./.:1:3:1:0,3,16 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0 ./.:0:0:0:0,0,0
Which looks like nonsense to me. Do I need to use GenotypeGVCFs after that? Thus, I used vcftools.
This is "just" a statistical block. Read https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-faqs/What_is_a_GVCF_and_how_is_it_different_from_a_%27regular%27_VCF%3F.md
yes, and no other tool.
Aaah! Very sorry - thanks for your time. Is there a one-document tutorial for this process outlining a full work-flow and what arguments to use under different circumstances (e.g. sequencing platforms)? Or is it just from the webpages?
https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-