Question

Running snakemake on chromosomes in parallel

1

Entering edit mode

3.1 years ago

nadietz ▴ 10

I'm a new snakemake user and struggling with a problem that I have yet to find an answer for from my googling efforts. I am attempting to merge vcf files (split by chromosome) from several different WGS datasets into a single vcf file per chromosome using "bcftools merge". My script structure is as follows.

A .tsv file ('input_files.tsv') where the user can supply the path and prefix of each dataset:

path            prefix
/path/to/d1/    dataset1_Chr
/path/to/d2/    dataset2_Chr

where the files in each folder would look like this:

/path/to/d1/dataset1_Chr01.vcf.gz
/path/to/d1/dataset1_Chr02.vcf.gz

/path/to/d2/dataset2_Chr01.vcf.gz
/path/to/d2/dataset2_Chr02.vcf.gz

I have snakefile that looks similar to this:

import pandas as pd
input_files = pd.read_tsv('input_files.tsv')

rule merge_vcf:
    input: ????
    output: "merged_Chr{num}.vcf"
    shell: "bcftools merge {input} -o {output}"

My question is two-fold:

1) How can I construct the input and output filenames such that snakemake will sequentially merge the Chr01 files for all datasets together, then merge the Chr02 files for each dataset together, and so on.... without merging Chr01+Chr02 into the same file.

2) Since the merge for each chromosome is independent of other chromosomes, I would like to perform these in parallel (I'm using a SLURM workload manager on a HPC cluster here). I came across the "group" parameter in the snakemake docs, however, it's unclear to me if this parameter does what I want.

I recognize that I can explicitly define the files for each dataset in the input, like so:

rule merge_vcf:
    input: 
        d1 = expand("dataset1_Chr{num}.vcf.gz", num = ['01', '02'])
        d2 = expand("dataset2_Chr{num}.vcf.gz", num = ['01', '02'])
    output: "merged_Chr{num}.vcf"
    shell: "bcftools merge {input.d1} {input.d2} -o {output}"

however, that requires me to know in advance how many datasets the user wants to merge. I would like this script to take a dynamic number of datasets in the input file without having to modify the snakefile. I know that I can pass a python list of ALL the files to the input parameter, but then I don't understand how to tell bcftools to only merge Chr01 files together, then Chr02, and so on..

Any help is much appreciated.

snakemake python vcf • 1.7k views

ADD COMMENT • link updated 3.1 years ago by Pierre Lindenbaum 161k • written 3.1 years ago by nadietz ▴ 10

0

Entering edit mode

please explore zip function in snakemake.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

Are you sure 'zip' can be used to solve this?

expand("{dataset}{num}.vcf.gz", zip, dataset = ['dataset1_Chr', 'dataset2_Chr'], num = ['01', '02'])

would yield:

dataset1_Chr01.vcf.gz, dataset2_Chr02.vcf.gz

for the inputs, but I need:

dataset1_Chr01.vcf.gz, dataset2_Chr01.vcf.gz

...then...

dataset1_Chr02.vcf.gz, dataset2_Chr02.vcf.gz

as my inputs. Am I misunderstanding how to use zip?

ADD REPLY • link 3.1 years ago by nadietz ▴ 10

0

Entering edit mode

That is not what I meant. Create two lists for files for each directory, zip the two lists.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

Yea it was difficult to glean what you meant by 'please explore zip,' but your second comment helps a bit. Doesn't this strategy still require me to know in advance how many datasets the user will provide? I would like the solution to be able to process any number of datasets.

ADD REPLY • link 3.1 years ago by nadietz ▴ 10

score 2 · Answer 1 · 2021-03-15

2

Entering edit mode

3.1 years ago

Pierre Lindenbaum 161k

OK , I'm not a snakemake guy so here is a nextflow solution. Search for all your vcf and extract the contigs using 'bcftools index -s'

usage: nextflow run workflow.nf --baseDir /path/to/base/directory

Since the merge for each chromosome is independent of other chromosomes, I would like to perform these in parallel

define a config file where 'executor' is sge

https://www.nextflow.io/docs/latest/executor.html#sge

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm planning on using snakemake for my pipeline. But maybe this will help someone else that is using nextflow.

ADD REPLY • link 3.1 years ago by nadietz ▴ 10

0

Entering edit mode

I'm a new snakemake user

you should give you the chance to compare both tools at this point.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008622

Using prototyping to choose a bioinformatics workflow management system

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you for the suggestion, but I'm struggling enough just to learn one tool. The question specifically asks for snakemake...

ADD REPLY • link 3.1 years ago by nadietz ▴ 10