GATK HaolotypeCaller takes too much time for variant calling
3
1
Entering edit mode
6.4 years ago
J.F.Jiang ▴ 910

Hi all,

I am using GATK HaplotypeCaller to genotype ~3000 SNPs from amplicon sequencing data, using --genotyping_mode GENOTYPE_GIVEN_ALLELES --alleles $vcf --output_mode EMIT_ALL_SITES mode.

However, it take huge time to call these variants, ~20h.

And I found that adding -nct option did not work, and it even increase to 23h.

Is there any option to accelerate the calling process?

Thanks,
Junfeng

HaplotypeCaller GATK • 8.4k views
ADD COMMENT
1
Entering edit mode
5.9 years ago
BioinfGuru ★ 1.7k

GATK4 HaplotypeCaller no longer has the option to use -nt or -nct. HaplotypeCaller in Spark is in development so that the program can be parallelised, however this is only in beta and is not yet recommended.

To speed things up, I am running HaplotypeCaller with the -L option. I run the program once for each chromosome and then concatenate the results with bcftools (cat won't work because of the file headers)

Pseudocode:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 32  # not -nt or -nct , default = 4 (1/3 extra runtime)
         --L $chr
    }
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf
ADD COMMENT
0
Entering edit mode

Thanks, YaGalbi! I have a related question --
I am trying to figure out how to do exatly what you are talking about, but my reference genome is a denovo assembly with scaffolds, not chromosomes yet. Will that work with just scaffold numbers? I have ±400 scaffolds in a 300Mb genome reference so ideally would do it in 10 steps of 30Mb.. Thanks for you answer!

ADD REPLY
0
Entering edit mode

OK, it works fine for scaffolds as well.

My commands were:

list=$(cat scaffolds.list)

for i in $list

nohup gatk --java-options "-Xmx50g" HaplotypeCaller -R genome.fasta -I markedDup.bam -I sorted_MarkedDup.bam -L "$i" --genotyping-mode DISCOVERY -O "$i.vcf" &> $i.log &

done

vcf-concat *.vcf > all.vcf

ADD REPLY
1
Entering edit mode
2.9 years ago
paulsme2 ▴ 10

Hello. this is a few years late, but I hope this helps someone else out. I made an improved version of YaGalbi's code above so that it is much faster (9-fold faster). By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step. However, you must adjust your resource partitioning in --java-options "Xmx__g" and --native-pair-hmm-threads so that they do not exceed the machine limit for all chromosomes being computed. For example, I run on an HPC server and have 16gb RAM and 6 CPU for each of my 10 chromosomes. This means I allocated 172 gb RAM and 64 CPU (a little extra just in case). This only works if you can allocate these kinds of resources. Here is the updated code:

@chr.list = (1..22, X, Y) # make a list of chromosomes
foreach $chr (@chr.list) # for each chromosome in the list
    {
         gatk --java-options "-Xmx16g" HaplotypeCaller 
         --input example.bam 
         --output example.gvcf.gz 
         --reference human.fasta 
         --emit-ref-confidence GVCF 
         --dbsnp knownsnps.vcf 
         --native-pair-hmm-threads 6
         --L $chr
    } & done;
wait
bcftools concat -o example.gvcf 1.gz 2.gz ... Y.gz # must be in order
rm *.gz *.tbi
gzip example.gvcf
ADD COMMENT
0
Entering edit mode

By adding & done; wait to the end of the @chr loop allows all processes to be run simultaneously. wait makes the machine stop before continuing onto the concatenation step.

please, use a workflow manager: make, nextflow, snakemake...

ADD REPLY
0
Entering edit mode
6.4 years ago

use -nt instead of -nct

ADD COMMENT
0
Entering edit mode

Futhermore, you can split the bam file into different chromosomes using "bamtools split -reference"

And then call GATK HaplotypeCaller simultaneously on all the chromosomes

ADD REPLY
0
Entering edit mode

-nt can be only applied in UnifiedGenotyper.

Split bam against reference is a kind of simple methods for distributed computation, as bcbio-ngs did.

Any other methods?

ADD REPLY
0
Entering edit mode

sorry , it's my mistake.

But I used -nct for HaolotypeCaller , and it runs faster.

My command is

java -jar GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I NA12878-Garvan-Vial1.sort.combine.nodup.base_recalibrated.REF_3.bam --genotyping_mode DISCOVERY -o REF_3.vcf

no -nct, it costs 3.4h

with -nct 16, it costs 0.4h

ADD REPLY
0
Entering edit mode

Actually I also found such a strange issue, when -nct was applied for WGS or WES data, it significantly decrease the runing time, but not for data sequenced from amplicons.

ADD REPLY

Login before adding your answer.

Traffic: 2412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6