How to handle metadata in bioinformatics pipelines?
2
8
Entering edit mode
7.9 years ago
James Ashmore ★ 3.4k

After reading a review of bioinformatics pipeline frameworks I've started testing a few to see if I could create a simple ChIP-seq peak-calling pipeline. However I always seem to run into the same problem - what format should I put my metadata in, and how can I specify sample comparisons (for example ChIP-factor against naked DNA) such that the pipeline is flexible to different numbers of samples and comparisons?

With Snakemake you can provide a .yaml or .json file to describe the experiment and the comparisons, but in a large study doing this by hand is quite tedious (I wonder if there is a nice way to parse a .tsv sample design file into a .yaml or .json configuration?). Alternatively if I were to use GNU Make all of the comparisons would have to either be specified manually in the Makefile, or a single rule would be used to call a script which runs the comparisons by parsing a sample design file. In the latter case I can't take advantage of Make's parallel jobs feature and would have to use the parallel processing modules available to the scripting language.

I feel that whatever pipeline I wrote with these frameworks would continuously have to be changed or updated to accommodate new numbers of samples or comparisons. When really all I want is to be able to provide a new sample design file and have the pipeline adapt to the new design without too much extra tweaking. Maybe I'm expecting too much, but is there a particular representation of metadata you've found works well, or could you recommend a different framework which handles metadata and comparisons directly?

metadata pipeline snakemake make • 3.8k views
ADD COMMENT
1
Entering edit mode

In the case of snakemake, you can get pretty far with parsing a TSV file and using conditionals in the input/output/params section. I've yet to see a perfect solution for metadata handling, though.

ADD REPLY
0
Entering edit mode

After a bit more testing I agree that parsing a TSV file is probably the best way to handle metadata until a framework with integrated support comes along. For now I'll use a script I wrote to convert the sample sheet into yaml format which Snakemake understands.

ADD REPLY
0
Entering edit mode

I'm not sure I understand exactly what you mean with metadata. Could you clarify that a little? Do you mean a description of the outputs you want the workflow to produce?

Part of your description sounds like something I've been ranting before, about the need for dynamic workflow scheduling, which means you can do computations where the number of tasks (be it comparisons) is determined based on the output of an earlier task in the workflow. That is typically possible with data-flow based systems, and so I would expect it to be possible in nextflow, and also in my own experimental library scipipe, where the concrete need for dynamic scheduling was the whole motivation behind writing it.

ADD REPLY
1
Entering edit mode

In this case, I would interpret metadata to be things likes groups and control samples that are sometimes, but not always, be used for normalizations. The classic example would be ChIPseq, where different ChIPs will have different controls. In an ideal world these would have some consistent naming scheme, but on really large projects that may not be a reasonable assumption.

ADD REPLY
0
Entering edit mode

A database with a good API would be a good solution.

ADD REPLY
0
Entering edit mode

Hi James,

Sorry to be late to this discussion, but I couldn't find any other way to contact you.

I am not quite clear on one thing: by "Number of samples" do mean number of input files to be merged for one sample? Because for a Chip-factor vs. input, you're not talking multiple actual samples, right? Or are you talking about something like ChIP differential peak analysis which uses multiple samples? Anyway, I'm not quite clear on the question but I think you could do this pretty easily with a new system I'm working on called looper... for simple comparisons, you would define a tsv with one line per comparison, and then write a pipeline that runs on the tsv. we use a merge table you define (another tsv) to allow you to define any number of inputs in either category.

It's still early days and I've been trying to think of ideas for how to do something like this better. So since you've been thinking about it, if this doesn't solve your issues, I'm all ears.

ADD REPLY
1
Entering edit mode
7.9 years ago

I agree with the concerns of the OP and I think Snakemake and other frameworks should have some more built-in tools to deal with metadata (other than rolling your own lookups)

Snakemake and other implicit frameworks will also suffer when all the filenames are UUIDs (as we are seeing in TCGA/GDC although they have been kind enough to preserve suffixes so far), so a robust system for filename transformation would also be nice.

I imagine one of the goals of SUSHI was to formalize this kind of thing by encouraging people to do this from the start. I hope there might be some way some of those principles can boil down to frameworks that aren't full-blown Rails apps.

In addition to CWL, it's worth looking at WINGS http://www.wings-workflows.org/

ADD COMMENT
0
Entering edit mode
7.9 years ago
igor 13k

Look into Common Workflow Language: http://www.commonwl.org/

The Common Workflow Language (CWL) is an informal, multi-vendor working group consisting of various organizations and individuals that have an interest in portability of data analysis workflows. Our goal is to create specifications that enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility.

All these pipelines are using that: https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

ADD COMMENT
0
Entering edit mode

yes nice support for ontologies to define input/output formats in CWL

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/fastq_to_fasta.cwl

cwlVersion: v1.0
class: CommandLineTool

hints:
  - class: SoftwareRequirement
    packages:
      biopython:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_007173" ]
        version: [ "1.65", "1.66", "1.69" ]

inputs:
  fastq:
    type: File
    format: edam:format_1930  # FASTQ

baseCommand: [ python ]

arguments:
  - valueFrom: |
      from Bio import SeqIO; SeqIO.convert("$(inputs.fastq.path)", "fastq", "$(inputs.fastq.basename).fasta", "fasta");
    prefix: -c

outputs:
  fasta:
    type: File
    outputBinding: { glob: $(inputs.fastq.basename).fasta }
    format: edam:format_1929  # FASTA

$namespaces:
 edam: http://edamontology.org/
 s: http://schema.org/
$schemas:
 - http://edamontology.org/EDAM_1.16.owl
 - https://schema.org/docs/schema_org_rdfa.html

s:license: "https://www.apache.org/licenses/LICENSE-2.0"
s:copyrightHolder: "EMBL - European Bioinformatics Institute"
ADD REPLY

Login before adding your answer.

Traffic: 2057 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6