Forum:Snakemake vs. Nextflow: strengths and weaknesses
6
41
Entering edit mode
6.8 years ago
ropolocan ▴ 810

I have seen increasing interest in workflow/pipeline management systems such as snakemake and nextflow. In my opinion, both seem very interesting and very promising. There is a very interesting review from 2016 in which bash, make, snakemake and nextflow were compared: https://www.jmazz.me/blog/NGS-Workflows

The author of that review did a very good job of analyzing the strengths and weaknesses of snakemake and nextflow. I am not sure how much has changed since then, but in your experience, what would be some criteria that bioinformaticians could consider to choose one over the other? Have some of the identified weaknesses of both snakemake and nextflow have been addressed since then?

snakemake nextflow • 43k views
ADD COMMENT
5
Entering edit mode

I started using snakemake 6 months ago, and now I have shifted all my pipelines to snakemake (ChIP-seq, RNA-seq, ATAC-seq and DNA-seq). I am pretty happy with it. once you get the idea of how snakemake works (think in a bottom-up fashion), it is easy to build up your own pipelines. BTW, the documentation is awesome.

you can write a customer script for submitting jobs to the cluster for each platform (LSF, moab...) if you want more control of your jobs. e.g. https://bitbucket.org/snakemake/snakemake/issues/28/clustering-jobs-with-snakemake

only downside for me is that when I have more than 1000 jobs to submit, it takes time for snakemake to process the metadata associated with each job. For a dry-run, it takes minutes. I do not know how fast nextflow is.

ADD REPLY
1
Entering edit mode

And BioMake is off the game? It uses prolog (which is both the weakness and the strength...).

ADD REPLY
0
Entering edit mode

Hello, @kamiljaron. I was not aware of BioMake; I would have to read up on it. I do not know prolog, nor have I ever used a logical programming language, but I will read more about what BioMake has to offer.

ADD REPLY
2
Entering edit mode

No knowledge of prolog required! You can use gnu make syntax to specify your workflow

ADD REPLY
0
Entering edit mode

How to compare WDL/CWL and snakemake?

ADD REPLY
2
Entering edit mode

never used either of them but CWL is just a specification for how to describe a pipeline, it does not actually execute a pipeline itself. Snakemake executes the pipeline in addition to describing it with its own syntax

ADD REPLY
14
Entering edit mode
6.8 years ago

Besides what a tool can or cannot do I like to check the quality of the documentation, whether it is actively developed and maintained, how many developers contribute to it, and size of the user base.

It seems to me that snakemake and nextflow are pretty much on a draw for all these metrics and both are pretty good (although in terms of user base and developers they are far from tools like luigi). So I think it's a difficult choice between these two...

I haven't tried nextflow, but recently I started working with snakemake and I'm very happy with it. Actually I feel dumb that for years I've been hacking together bash scripts to run pipelines. For me one advantage of snakemake is that a snakemake script is effectively python with additional features on top. So if you know python, putting some complex logic and functions in a snakemake script is straightforward. I guess the same applies to nexflow but using groovy, which is not so popular though.

From the review you link it seems nextflow doesn't have a "dry run" option. I find dryrun to be super useful to see what would be executed and for developing and debugging is great.

Just my 2p...

ADD COMMENT
1
Entering edit mode

Thank you very much for your answer, @dariober.

It seems to me that snakemake and nextflow are pretty much on a draw for all these metrics and both are pretty good (although in terms of user base and developers they are far from tools like luigi). So I think it's a difficult choice between these two...

I am curious about luigi. I have read many good comments about it, and I will be looking into testing it as well. I was testing Snakemake and I can see why it has garnered attention.

Actually I feel dumb that for years I've been hacking together bash scripts to run pipelines. For me one advantage of snakemake is that a snakemake script is effectively python with additional features on top. So if you know python, putting some complex logic and functions in a snakemake script is straightforward. I guess the same applies to nexflow but using groovy, which is not so popular though.

Using snakemake was kind of an "eureka" moment for me as well. It has so much potential, and I look forward to adapt other pipelines I had written on bash or python to snakemake.

ADD REPLY
1
Entering edit mode

About the dry run option, if I am not wrong, nextflow does not have it because it does not know a priori what will be the exact execution dag. Nextflow language is more expressive and the execution dag may depends on the input data if you have conditional executions in your workflow for example (which is not possible in Snakemake I think?)

ADD REPLY
0
Entering edit mode

Snakemake allows for conditional creation of the DAG and conditional execution of different code based on the input.

ADD REPLY
1
Entering edit mode

For me one advantage of snakemake is that a snakemake script is effectively python with additional features on top.

I would call this a disadvantage. The Python ecosystem is a mess to work with when it comes to 3rd party libraries. I tried to install it for myself on our HPC and immediately hit a million issues with environment management, not all of which are solveable with virtualenv's or conda. On the other hand, Nextflow installs seamlessly on any system that has Java 8, including our HPC. Re-learning the few extra programming bits I needed in Groovy was a very small price to pay in order to have Nextflow's greater ease of portability & execution.

ADD REPLY
2
Entering edit mode

Funny, I find java generally more annoying to deal with. To each their own I guess.

ADD REPLY
1
Entering edit mode

I've never actually had to deal with Java to get Nextflow to work, beyond making sure it was installed and using Java 8. Installing Nextflow has been a one-liner on every system I've tried. On the other hand, every Python based workflow management system I have tried (along with most other Python packages) have required a lot of hands-on environment configuration and management, which is not only a pain in the butt but also greatly impairs the feasibility of popping up a pipeline instance on new systems on an ad-hoc basis.

ADD REPLY
0
Entering edit mode

what is luigi and how come I've never heard of it? It's supposed to be doing the same as Snakemake/NextFlow, while being even more popular?!

ADD REPLY
9
Entering edit mode
6.8 years ago

Table 1: Comparison of Nextflow with other workflow management systems

Workflow Nextflow Galaxy Toil Snakemake Bpipe
Platforma Groovy/JVM Python Python Python Groovy/JVM
Native task supportb Yes (any) No No Yes (BASH only) Yes (BASH only)
Common workflow languagec No Yes Yes No No
Streaming processingd Yes No No No No
Dynamic branch evaluation Yes ? Yes Yes Undocumented
Code sharing integratione Yes No No No No
Workflow modulesf No Yes Yes Yes Yes
Workflow versioningg Yes Yes No No No
Automatic error failoverh Yes No Yes No No
Graphical user interfacei No Yes No No No
DAG renderingj Yes Yes Yes Yes Yes
Container management
Docker supportk Yes Yes Yes No No
Singularity supportl Yes No No No No
Multi-scale containersm Yes Yes Yes No No
Built-in batch schedulersn
Univa Grid Engine Yes Yes Yes Partial Yes
PBS/Torque Yes Yes No Partial Yes
LSF Yes Yes No Partial Yes
SLURM Yes Yes Yes Partial No
HTCondor Yes Yes No Partial No
Built-in distributed clustero
Apache Ignite Yes No No No No
Apache Spark No No Yes No No
Kubernetes Yes No No No No
Apache Mesos No No Yes No No
Built-in cloudp
AWS (Amazon Web Services) Yes Yes Yes No No

ADD COMMENT
4
Entering edit mode

To be fair it would be nice to see the same table compiled or commented by the authors of snakemake... With respect to slurm, I don't know what is meant by "partial" support in snakemake. I started playing with snakemake and running jobs using slurm is incredibly simple.

ADD REPLY
3
Entering edit mode

Yeah, snakemake has full support for anything that uses drmaa, which I expect is also what Galaxy uses and probably what nextflow uses. Further, the footnote in the table for that section basically amounts to, "Actually, it has full support for these and any future schedulers, you just have to tell it how to execute the commands." I prefer the snakemake way of doing this, since everyone submits jobs through a wrapper I wrote and that way lots of things (temp space, memory usage, queue, etc.) can be conveniently set without including them again and again in snakemake files.

ADD REPLY
0
Entering edit mode

Nextflow does not use DRMAA. It uses the scheduler's native directives. Here is an example from the source code. Also note that cluster options such as memory and CPUs can all be set for pipeline processes independently of the actual pipeline script, and you can use profiles to have multiple sets of configurations for different systems (e.g. one pipeline script, and different execution configs for HPC, local, AWS, etc.). Docs here

ADD REPLY
2
Entering edit mode

All of that largely applies to snakemake too :)

ADD REPLY
2
Entering edit mode

The table is outdated by now, Snakemake does support Kubernetes AFAICT: https://snakemake.readthedocs.io/en/stable/executable.html#executing-a-snakemake-workflow-via-kubernetes

ADD REPLY
5
Entering edit mode

The table was never particularly accurate to begin with.

ADD REPLY
1
Entering edit mode

Thanks for sharing this table, @shenwei356! It is very interesting to see that nextflow has stream processing, workflow versioning, and full support for SLURM, in addition to having native task support for any language. I believe snakemake has native task support for R now as well. Thanks again for your answer.

ADD REPLY
1
Entering edit mode

I think this table is little outdated in relation to bpipe - which I use daily. SLURM support exists (at least we are using it in-house), and I am pretty sure that stages in R can be run natively without wrapping in an Rscript.

ADD REPLY
7
Entering edit mode
6.8 years ago
Sinji ★ 3.2k

I'm a big fan of Nextflow. I've used Snakemake in the past, and it was originally my go-to workflow language, but the built in support for Docker, Singularity, and HPC environments that Nextflow provides just can't be beat.

The only downside is you have to use Groovy.

ADD COMMENT
1
Entering edit mode

Snakemake has singularity support with the singularity directive. I haven't used nextflow, but I would be amazed if it is as flexible as Snakemake. (Note to past self: there is a flexibility vs rigor tradeoff.)

ADD REPLY
3
Entering edit mode

Get prepared to be amazed.

ADD REPLY
0
Entering edit mode

Thank you very much for your answer, @Sinji. I also look forward to test Nextflow. Both workflow systems/languages have so much potential. I think they could make a very important impact on bionformatics.

ADD REPLY
0
Entering edit mode

nextflow is a good programming language and very promising that I agree. Unfortunately I have run into one of its limitations: the static nature of the channel object. I was composing more complicated pipelines. I believe this limitation will be eliminated in future efforts by the nextflow team. I have submitted feature requests. My problem is when my processA produces many files, particularly fastq paired reads, then my processB will use the output files from processA as input in such a way that we want to turn the files into parallel executions for each pair. Say I only have one (actually I can have many) processA running. While processA is generating pairs of fastq files (file1_R1.fastq.gz, file1_R2.fastq.gz), (file2_R1.fastq.gz, file2_R2.fastq.gz), ... I want processB to start processing the pair of files as processA is finished generating them. Right now I have to use two pipelines to do this job. the Channel.fromFilePairs("*_R{1,2}.fastq.gz) will be empty on start of the pipeline. So processB will never do anything since the Channel was constructed at script launch time and empty. I guess the implementation of such a dynamic feature needs very fundamental changes to the language. There might be other ways to implement my requirement that I have not explored yet. I am still learning nextflow. For example the watchPath, subscribe, etc. If any one has experienced such limitations, and had found a solution please let me know.

ADD REPLY
1
Entering edit mode

This sounds like something that goes against the basic premise of Nextflow. Its possible to have multiple outputs, even a dynamic amount of outputs, but they are not released from a task until the task is complete. As far as I know, watchPath only works for the initiation of loading items into a channel, not for watching for outputs of a task. I cannot think of a good solution for this in Nextflow, since this would violate the integrity of the task (e.g. what do you do if you release an output then the task fails after some output has already proceeded through rest of the pipeline?). Personally, I would just consider this an edge-case where you have to accept a small throughput penalty waiting for processA to finish producing all files before any files can continue with the pipeline.

ADD REPLY
0
Entering edit mode

Can you not just output from one process a tuple of the two paired files, and use this channel as input for the next process? See https://www.nextflow.io/docs/latest/process.html?highlight=tuple#output-tuple-of-values

ADD REPLY
4
Entering edit mode
6.8 years ago
ropolocan ▴ 810

I just read this excellent review by @Jeremy Leipzig. This article can be helpful for deciding which workflow management system is more suitable to each one's needs: https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbw020

ADD COMMENT
4
Entering edit mode
6.4 years ago
ropolocan ▴ 810

I am revisiting this post to mention that snakemake supports automated deployment of software dependencies with conda as well as the specification of conda environments per rule. This is very exciting!

ADD COMMENT
1
Entering edit mode

This feature has also been added to Nextflow. Link

ADD REPLY
0
Entering edit mode

Excellent! Thanks for bringing attention to this, @steve. I will definitely try it out.

ADD REPLY
2
Entering edit mode
6.3 years ago
ropolocan ▴ 810

I thought I would share this Reddit thread on workflow management systems. There are very interesting posts on snakemake vs. nextflow: https://www.reddit.com/r/bioinformatics/comments/73am0k/ncbi_hackathons_discussions_on_bioinformatics/

ADD COMMENT

Login before adding your answer.

Traffic: 2802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6