Question

How process inputs based on a filename pattern using CWL

0

Entering edit mode

6.8 years ago

thomas.e ▴ 110

This must be answered somewhere but I can't find it. How do I process all files matching a pattern, e.g.


/path/to/rawdata/*.dat

I've seen examples where multiple inputs are explicitly listed but where a pattern is specified.

CWL • 2.3k views

ADD COMMENT • link 6.7 years ago by thomas.e ▴ 110

score 2 · Answer 1 · 2017-08-21

For others looking at this thread, this is how I solved the problem. There may be better ways.

Basically, I ran my single file workflow as a sub-workflow in a step that scatters across the input files. I did not attempt to get CWL to scan directories for input files, so I will build the input file as a seperate step.

So, the input looks like:


reads1:
  - class: File
    format: "edam:format_1930"
    location: "../data/S1_R1.fastq.gz"
  - class: File
    format: "edam:format_1930"
    location: "../data/S2_R1.fastq.gz"
reads2:
  - class: File
    format: "edam:format_1930"
    path: "../data/S1_R2.fastq.gz"
  - class: File
    format: "edam:format_1930"
    path: "../data/S2_R2.fastq.gz"


$namespaces: { edam: http://edamontology.org/ }
$schemas: [ http://edamontology.org/EDAM_1.16.owl ]

I then have a scatter workflow:


class: Workflow
cwlVersion: v1.0

requirements:
- class: InlineJavascriptRequirement
- class: ScatterFeatureRequirement
- class: StepInputExpressionRequirement
- class: SubworkflowFeatureRequirement

inputs:
  reads1: File[]
  reads2: File[]
 # I found that with toil some constant input files also need to be reproduced here
  adapters:
      type: File
      default:
        class: File
        path: /path/to/adapters/TruSeq3-PE.fa
        location: /path/to/adapters/TruSeq3-PE.fa

outputs:
  [all the workflow outputs here]

steps:

  all:
    run: single-file-pl.cwl

    scatter: [read1, read2]
    scatterMethod: dotproduct

    in:
      read1:
        source: reads1
      read2:
        source: reads2
      adapters:
        source: adapters


    out: [all workflow output here]

Hopefully, this helps someone.

score 1 · Answer 2 · 2017-06-29

Hello thomas.e,

For CWL implementations that consume YAML/JSON input objects you'll need a separate File entry for each file.

Here's an example input object, assuming an input named raw_data and of the type File[] (also known as type: array, items: File )

raw_data:
  - class: File
    path: /path/to/rawdata/000.dat
  - class: File
    path: /path/to/rawdata/001.dat

I've made an issue to add a convenience feature to the reference implementation to make this easier: https://github.com/common-workflow-language/cwltool/issues/448

score 1 · Answer 3 · 2017-06-29

I see two scenarios here:

This is how you would invoke your tool from the shell, and shell would expand glob to the list of files
The tool itself accepts glob as an input

In case of 1. as I mentioned, tool never receives *.dat as an argument. It actually gets resolved glob, a list of file paths. The way to handle this in CWL is with mini workflow: first you have a tool that would receive a list of files as an input and create semantically meaningful outputs, say dat_files, meta_files, other_files. You do that by specifying globs on outputs of that tool. Next you connect dat_files to your tool which than receives all file paths on its command line, precisely as if it got invoked with a glob.

In the other case, the best course of action would be to stage input files to working directory, and pass a glob as a string to the tool.