Question

Problem parsing interval list file to GenomicsDBImport in GATK4

2

Entering edit mode

6.4 years ago

Yatros ▴ 20

Hello,

I'm trying to combine 6 GVCF files into a single VCF file using GATK4 GenomicsDBImport + GenotypeGVCFs. I'm still using GATK4.beta5, since I'm not able to find the link to GATK4.beta6.

My target.interval file looks like this:

1:8022745-8023035
1:8025283-8025585
1:8029304-8029564

If I run the following command:

gatk-launch GenomicsDBImport \
-V sample_001.gvcf.gz \
-V sample_002.gvcf.gz \
-V sample_003.gvcf.gz \
-V sample_004.gvcf.gz \
-V sample_005.gvcf.gz \
-V sample_006.gvcf.gz \
--genomicsDBWorkspace testDB \
--L 1:8022745-8023035

GATK will work and generate a database with genotypes for this specific interval.

However, if I run the following script:

#!/bin/bash
while read in; do \
gatk-launch GenomicsDBImport \
-V sample_001.gvcf.gz \
-V sample_002.gvcf.gz \
-V sample_003.gvcf.gz \
-V sample_004.gvcf.gz \
-V sample_005.gvcf.gz \
-V sample_006.gvcf.gz \
--genomicsDBWorkspace testDB \
--L "$in"; done < target.interval

GATK runs, but I get the following error for every single interval and nothing is added to the database:

: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035

: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585

I understand that the software is recognizing each line of my file correctly but, for some reason, it cannot parse the start and end position of each interval.

I also tried to use a regular bed file as input file with the following format:

1 8022745 8023035
1 8025283 8025585
1 8029304 8029564

In that case the error was the following one:

A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig '1  8022745 8023035' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

Can anybody help me with this issue?

Thank you very much,

Here you have the entire error message for the first two intervals:

09:50:28.057 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte                                                                     l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:27 AM PST] GenomicsDBImport  --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian                                                                     t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1                                                                     7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5                                                                     0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063                                                                       --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa                                                                     meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency                                                                      SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian                                                                     tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref                                                                     etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla                                                                     ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:27 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta                                                                     .5
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:28.443 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:28.444 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:28.444 INFO  GenomicsDBImport - Deflater: IntelDeflater
09:50:28.444 INFO  GenomicsDBImport - Inflater: IntelInflater
09:50:28.444 INFO  GenomicsDBImport - GCS max retries/reopens: 20
09:50:28.444 INFO  GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree                                                                     /dr_all_nio_fixes
09:50:28.444 INFO  GenomicsDBImport - Initializing engine
09:50:38.228 INFO  GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:38 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.17 minutes.
Runtime.totalMemory()=1368915968
***********************************************************************

: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
09:50:47.419 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte                                                                     l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:47 AM PST] GenomicsDBImport  --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian                                                                     t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1                                                                     7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5                                                                     0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063                                                                       --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa                                                                     meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency                                                                      SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian                                                                     tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref                                                                     etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla                                                                     ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:47 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta                                                                     .5
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:47.976 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:47.976 INFO  GenomicsDBImport - Deflater: IntelDeflater
09:50:47.976 INFO  GenomicsDBImport - Inflater: IntelInflater
09:50:47.976 INFO  GenomicsDBImport - GCS max retries/reopens: 20
09:50:47.976 INFO  GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree                                                                     /dr_all_nio_fixes
09:50:47.976 INFO  GenomicsDBImport - Initializing engine
09:50:53.175 INFO  GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:53 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=1389887488
***********************************************************************

: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585

***********************************************************************

gatk interval parsing GVCF • 6.6k views

ADD COMMENT • link updated 6.0 years ago by prnsml • 0 • written 6.4 years ago by Yatros ▴ 20

0

Entering edit mode

Hello,

I have tried to handle this issue with a WDL + CROMWELL script, but I am still facing a problem. I will detail my input files and my script to see if you can help me solve the problem.

I am running the following command:

java -Dconfig.file=application.conf -jar $CROMWELL run my.wdl --inputs my.json

My application.conf looks like this:

backend {
  providers {
    Local {
      config {
          concurrent-job-limit = 25
        filesystems {
          local {
            localization: [
              "soft-link", "hard-link", "copy"
            ]
            caching {
              duplication-strategy: [
                "soft-link"
              ]
            }
          }
        }
      }
    }
  }
}

my.json file looks like this:

{
"Merge_gvcfs.human_g1k_v37_fa":"/mnt/ref/human_g1k_v37.fasta",
 "Merge_gvcfs.human_g1k_v37_fai":"/mnt/ref/human_g1k_v37.fasta.fai",
 "Merge_gvcfs.human_g1k_v37_dict":"/mnt/ref/human_g1k_v37.dict",
 "Merge_gvcfs.wdir":"/mnt/mywdir/",
 "Merge_gvcfs.target_interval":"/mnt/mywdir/Test.interval"
}

my.wdl script is the following one:

    workflow Merge_gvcfs {

    File human_g1k_v37_decoy_fa
    File human_g1k_v37_decoy_fai
    File human_g1k_v37_decoy_dict
    String wdir
    File target_interval
    Array[String] intervals = read_lines(target_interval)
    Array[File] in_files = merge_gvcfs.out


    # list all gvcf.gz files in wdir
    call get_file_names {
        input:
            wdir = wdir
    }


    # Run GenomicsDBImport for each interval in the intervals file
    scatter(inputs_row in intervals) {
        String interval = inputs_row
        call merge_gvcfs {
                input:
                    wdir = wdir,
                    interval = interval,
                    file_names = get_file_names.file_names
        }
    }   

    call generate_vcf {
        input:
            human_g1k_v37_decoy_fa = human_g1k_v37_decoy_fa,
            in_files = in_files
    }

}


    task get_file_names {
    String wdir
    command <<<
        readlink -f ${wdir}*.gvcf.gz>file_names.txt
        >>> 
        output {
            File file_names = "file_names.txt"
        }
    }   

    task merge_gvcfs{
    String interval
    String wdir
    File file_names
    Array[String] samples = read_lines(file_names)
            command {
            java -jar $GATK GenomicsDBImport \
            -V ${sep= ' -V ' samples} \
            --genomicsDBWorkspace Test \
            --intervals ${interval}
        }
        output {
        File out = "Test"
        }
        }

    task generate_vcf {
        File human_g1k_v37_decoy_fa
        Array[File] in_files
        command {
            java -jar $GATK GenotypeGVCFs \
            -R ${human_g1k_v37_decoy_fa} \
            -V gendb://${in_files} \
            -G StandardAnnotation -newQual \
            -O Test.vcf 
        }
    }

The error that I am getting with this script is the following one:

Error attempting to Execute java.lang.UnsupportedOperationException: Expression 'WdlExpression(<string:79:20 identifier "aW5fZmlsZXM=">)' evaluated to an Array but no 'sep' was specified

If I specified the 'sep' option I would have several input values for the GenotypeGVCFs -V option and I would get the following error:

A USER ERROR has occurred: Argument '[V, variant]' cannot be specified more than once.

My main problem is that the output of the merge_gvcfs task is an array of files and GenotypeGVCFs -V option allows only a single input value. Even if I flattened the merge_gvcfs array into a string, I would still have several input values for the generate_vcf task.

Is there any way to combine into a single folder the multiple output folders from the scatter function that loops through the intervals file list? This way I would be able to provide a single input value for the GenotypeGVCFs command.

Thank you very much for your help,

Best,

Yatros

ADD REPLY • link updated 6.3 years ago by Ram 43k • written 6.3 years ago by Yatros ▴ 20

0

Entering edit mode

I also tried to use a regular bed file as input file with the following format:

1 8022745 8023035

what is the output of

grep -w -m1 8022745 your.bed | tr "\t" "#"

?

ADD REPLY • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

The output from this grep is the following one: 1#8022745#8023035

I do not think that GenomicsDBImport is recognizing the input file in the right way. And I think that the input file should have the following format 1:8022745-8023035.

When I do a script using with WDL+CROMWELL with the input file using the format 1:8022745-8023035, the script runs without any problem, but it returns a different folder for every single interval and it is tough to merge them afterwards with GenotypeGVCFs, since this option from GATK allows only one input folder at a time, not several ones.

Thanks

ADD REPLY • link 6.3 years ago by Yatros ▴ 20

0

Entering edit mode

GATK sequence dictionary derived from the reference

don't you use a chromosome notation like "1" while the reference is using chr1 ?

ADD REPLY • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Dear Pierre-Lindenbaum. I already checked this. My reference file has no 'chr' notation. This is how the first line of my .dict file looks like:

@SQ SN:1    LN:249250621

Thanks anyway

ADD REPLY • link 6.3 years ago by Yatros ▴ 20

0

Entering edit mode

By the way, I already switched to GATK4.beta.6 and the problem with this script persists,

Thanks,

ADD REPLY • link 6.3 years ago by Yatros ▴ 20

2

Entering edit mode

Did you try getting in touch with the GATK team? They're super helpful in general, plus they're actively seeking feedback on GATK4.

ADD REPLY • link 6.3 years ago by Ram 43k

score 0 · Answer 1 · 2017-12-19

Hello,

I know that this solution is not elegant, nor solves the issue of adding the GenomicsDBImport command to a WDL pipeline, but for now, this is what works best for me:

#!/bin/bash
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8022745-8023035
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8025283-8025585
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8029304-8029564
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8030853-8031123
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8037611-8037898
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:8044853-8045214
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11073684-11074123
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11076800-11077164
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11078689-11079031
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11080385-11080757
java -jar $GATK GenomicsDBImport -V sample_001.gvcf.gz -V sample_002.gvcf.gz -V sample_003.gvcf.gz -V sample_004.gvcf.gz -V sample_005.gvcf.gz -V sample_006.gvcf.gz --genomicsDBWorkspace SAmerican_panel --L 1:11082080-11083405

[...]

Ram · Answer 2 · 2018-04-17

0

Entering edit mode

6.0 years ago

prnsml • 0

This command works for me

#!/bin/sh
while read i; do java -Xmx6g -jar $GATK GenomicsDBImport --sample-name-map samples.txt --genomicsdb-workspace-path DB_$i -L $i; done < contigs.list

where samples.txt contains <sampleID>tab<sampleID.g.vcf.gz>

and contigs.list includes one contig name per line:

Contig0
Contig3
...

ADD COMMENT • link updated 6.0 years ago by Ram 43k • written 6.0 years ago by prnsml • 0