How do I specify the elements of an array of output files?
1
0
Entering edit mode
5.9 years ago
biokcb ▴ 170

Hi -

I would like to have a workflow with the following set up:

Step 1: creates an array of N files with specific naming conventions

Step 2: scatters over the output of step 1

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
 - class: ScatterFeatureRequirement

inputs:
  input_file: File

steps:
  step1:
    run: step1.cwl
    in:
      input_file: input_file
    out: [output_files]
  step2:
    run: step2.cwl
    scatter: input_file
    in:
      input_file: step1/output_files
    out: [output_files] 

outputs: 
  final_out: 
    type: File[]
    outputSource: step2/output_files

Where Step 1 is something like this, where the command is just a shell script that splits the files into 4 independent files, each with specific naming conventions:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

baseCommand: split_file.sh

inputs:
  input_file:
    type: File
    inputBinding:
      position: 1

outputs:
  junctions:
    type: File[]
    outputBinding:
      glob:
       - $(inputs.input_file.basename).a.tmp
       - $(inputs.input_file.basename).b.txt
       - $(inputs.input_file.basename).c.fastq
       - $(inputs.input_file.basename).d.fasta

I know that I could just glob: "*" to gather all these outputs, but I want to specifically check for the existence each output before moving onto Step 2. When I tried the above, it returned an empty array as output of Step 1, even though the script being called did produce each output in the temp directory. If I use secondaryFiles, it doesn't scatter across them. Is it currently possible to achieve something like this with CWL and what would be the best way? As a note, I cannot currently use ExpressionTool as it isn't supported by the runner we are using just yet.

Thanks!

cwl • 1.4k views
ADD COMMENT
1
Entering edit mode
5.9 years ago

Hello @biokcb,

I think you have the right approach. Maybe your split_file.sh script isn't outputting to the correct location? glob is checking the current working directory for the specified patterns. If you need to pass a path to your script to tell it where to put its outputs you can use $(runtime.outdir) (no InlineJavascriptRequirement needed)

The following test script works for me:

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

inputs:
  input_file:
    type: File

baseCommand: touch

arguments:
  - $(inputs.input_file.basename).a.tmp
  - $(inputs.input_file.basename).b.txt
  - $(inputs.input_file.basename).c.fastq
  - $(inputs.input_file.basename).d.fasta

outputs:
  junctions:
    type: File[]
    outputBinding:
      glob:
       - $(inputs.input_file.basename).a.tmp
       - $(inputs.input_file.basename).b.txt
       - $(inputs.input_file.basename).c.fastq
       - $(inputs.input_file.basename).d.fasta

Example output

/home/michael/cwltool/env3/bin/cwltool 1.0.20180605140423
Resolved '../biostars_319448.cwl' to 'file:///home/michael/cwltool/biostars_319448.cwl'
[job biostars_319448.cwl] /tmp/tmpkegyt21b$ touch \
    README.rst.a.tmp \
    README.rst.b.txt \
    README.rst.c.fastq \
    README.rst.d.fasta
[job biostars_319448.cwl] completed success
{
    "junctions": [
        {
            "class": "File",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "path": "/home/michael/cwltool/t/README.rst.a.tmp",
            "basename": "README.rst.a.tmp",
            "location": "file:///home/michael/cwltool/t/README.rst.a.tmp",
            "size": 0
        },
        {
            "class": "File",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "path": "/home/michael/cwltool/t/README.rst.b.txt",
            "basename": "README.rst.b.txt",
            "location": "file:///home/michael/cwltool/t/README.rst.b.txt",
            "size": 0
        },
        {
            "class": "File",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "path": "/home/michael/cwltool/t/README.rst.c.fastq",
            "basename": "README.rst.c.fastq",
            "location": "file:///home/michael/cwltool/t/README.rst.c.fastq",
            "size": 0
        },
        {
            "class": "File",
            "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
            "path": "/home/michael/cwltool/t/README.rst.d.fasta",
            "basename": "README.rst.d.fasta",
            "location": "file:///home/michael/cwltool/t/README.rst.d.fasta",
            "size": 0
        }
    ]
}
Final process status is success
ADD COMMENT
1
Entering edit mode

Thanks for confirming that the array output is set up correctly, that helps! And you were correct about the script placing into another directory, where the output files were made, but not collected. I originally pointed glob to the files with the relative directory path or the absolute path and those did not work, but as it turns out, I just didn't fully understand where to place things like $(runtime.outdir) and use directories in my script set up. Thanks so much for the response! I just started learning CWL and this really helped clarify things!

ADD REPLY

Login before adding your answer.

Traffic: 2707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6