Question

GATK HaolotypeCaller takes too much time for variant calling

1

Entering edit mode

6.4 years ago

J.F.Jiang ▴ 910

Hi all,

I am using GATK HaplotypeCaller to genotype ~3000 SNPs from amplicon sequencing data, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES mode.

However, it take huge time to call these variants, ~20h.

And I found that adding -nct option did not work, and it even increase to 23h.

Is there any option to accelerate the calling process?

Thanks,
Junfeng

HaplotypeCaller GATK • 8.4k views

ADD COMMENT • link updated 11 days ago by Ram 43k • written 6.4 years ago by J.F.Jiang ▴ 910

score 1 · Answer 1 · 2018-05-30

1

Entering edit mode

5.9 years ago

BioinfGuru ★ 1.7k

GATK4 HaplotypeCaller no longer has the option to use -nt or -nct. HaplotypeCaller in Spark is in development so that the program can be parallelised, however this is only in beta and is not yet recommended.

To speed things up, I am running HaplotypeCaller with the -L option. I run the program once for each chromosome and then concatenate the results with bcftools (cat won't work because of the file headers)

Pseudocode:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 32  # not -nt or -nct , default = 4 (1/3 extra runtime)
         --L $chr
    }
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf

ADD COMMENT • link 5.9 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Thanks, YaGalbi! I have a related question --
I am trying to figure out how to do exatly what you are talking about, but my reference genome is a denovo assembly with scaffolds, not chromosomes yet. Will that work with just scaffold numbers? I have ±400 scaffolds in a 300Mb genome reference so ideally would do it in 10 steps of 30Mb.. Thanks for you answer!

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

0

Entering edit mode

OK, it works fine for scaffolds as well.

My commands were:

list=$(cat scaffolds.list)

for i in $list

nohup gatk --java-options "-Xmx50g" HaplotypeCaller -R genome.fasta -I markedDup.bam -I sorted_MarkedDup.bam -L "$i" --genotyping-mode DISCOVERY -O "$i.vcf" &> $i.log &

done

vcf-concat *.vcf > all.vcf

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

score 1 · Answer 2 · 2021-05-28

Hello. this is a few years late, but I hope this helps someone else out. I made an improved version of YaGalbi's code above so that it is much faster (9-fold faster). By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step. However, you must adjust your resource partitioning in --java-options "Xmx__g" and --native-pair-hmm-threads so that they do not exceed the machine limit for all chromosomes being computed. For example, I run on an HPC server and have 16gb RAM and 6 CPU for each of my 10 chromosomes. This means I allocated 172 gb RAM and 64 CPU (a little extra just in case). This only works if you can allocate these kinds of resources. Here is the updated code:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk --java-options "-Xmx16g" HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 6
         --L $chr
    } & done;
wait
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf

score 0 · Answer 3 · 2017-11-19

0

Entering edit mode

6.4 years ago

bluemonster0808 ▴ 40

use -nt instead of -nct

ADD COMMENT • link 6.4 years ago by bluemonster0808 ▴ 40

0

Entering edit mode

Futhermore， you can split the bam file into different chromosomes using "bamtools split -reference"

And then call GATK HaplotypeCaller simultaneously on all the chromosomes

ADD REPLY • link 6.4 years ago by bluemonster0808 ▴ 40

0

Entering edit mode

-nt can be only applied in UnifiedGenotyper.

Split bam against reference is a kind of simple methods for distributed computation, as bcbio-ngs did.

Any other methods?

ADD REPLY • link 6.4 years ago by J.F.Jiang ▴ 910

0

Entering edit mode

sorry , it's my mistake.

But I used -nct for HaolotypeCaller , and it runs faster.

My command is

java -jar GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I NA12878-Garvan-Vial1.sort.combine.nodup.base_recalibrated.REF_3.bam --genotyping_mode DISCOVERY -o REF_3.vcf

no -nct, it costs 3.4h

with -nct 16, it costs 0.4h

ADD REPLY • link 6.4 years ago by bluemonster0808 ▴ 40

0

Entering edit mode

Actually I also found such a strange issue, when -nct was applied for WGS or WES data, it significantly decrease the runing time, but not for data sequenced from amplicons.

ADD REPLY • link 6.4 years ago by J.F.Jiang ▴ 910