Running snakemake on chromosomes in parallel
1
1
Entering edit mode
3.1 years ago
nadietz ▴ 10

I'm a new snakemake user and struggling with a problem that I have yet to find an answer for from my googling efforts. I am attempting to merge vcf files (split by chromosome) from several different WGS datasets into a single vcf file per chromosome using "bcftools merge". My script structure is as follows.

A .tsv file ('input_files.tsv') where the user can supply the path and prefix of each dataset:

path            prefix
/path/to/d1/    dataset1_Chr
/path/to/d2/    dataset2_Chr

where the files in each folder would look like this:

/path/to/d1/dataset1_Chr01.vcf.gz
/path/to/d1/dataset1_Chr02.vcf.gz

/path/to/d2/dataset2_Chr01.vcf.gz
/path/to/d2/dataset2_Chr02.vcf.gz

I have snakefile that looks similar to this:

import pandas as pd
input_files = pd.read_tsv('input_files.tsv')

rule merge_vcf:
    input: ????
    output: "merged_Chr{num}.vcf"
    shell: "bcftools merge {input} -o {output}"

My question is two-fold:

1) How can I construct the input and output filenames such that snakemake will sequentially merge the Chr01 files for all datasets together, then merge the Chr02 files for each dataset together, and so on.... without merging Chr01+Chr02 into the same file.

2) Since the merge for each chromosome is independent of other chromosomes, I would like to perform these in parallel (I'm using a SLURM workload manager on a HPC cluster here). I came across the "group" parameter in the snakemake docs, however, it's unclear to me if this parameter does what I want.

I recognize that I can explicitly define the files for each dataset in the input, like so:

rule merge_vcf:
    input: 
        d1 = expand("dataset1_Chr{num}.vcf.gz", num = ['01', '02'])
        d2 = expand("dataset2_Chr{num}.vcf.gz", num = ['01', '02'])
    output: "merged_Chr{num}.vcf"
    shell: "bcftools merge {input.d1} {input.d2} -o {output}"

however, that requires me to know in advance how many datasets the user wants to merge. I would like this script to take a dynamic number of datasets in the input file without having to modify the snakefile. I know that I can pass a python list of ALL the files to the input parameter, but then I don't understand how to tell bcftools to only merge Chr01 files together, then Chr02, and so on..

Any help is much appreciated.

snakemake python vcf • 1.7k views
ADD COMMENT
0
Entering edit mode

please explore zip function in snakemake.

ADD REPLY
0
Entering edit mode

Are you sure 'zip' can be used to solve this?

expand("{dataset}{num}.vcf.gz", zip, dataset = ['dataset1_Chr', 'dataset2_Chr'], num = ['01', '02'])

would yield:

dataset1_Chr01.vcf.gz, dataset2_Chr02.vcf.gz

for the inputs, but I need:

dataset1_Chr01.vcf.gz, dataset2_Chr01.vcf.gz

...then...

dataset1_Chr02.vcf.gz, dataset2_Chr02.vcf.gz

as my inputs. Am I misunderstanding how to use zip?

ADD REPLY
0
Entering edit mode

That is not what I meant. Create two lists for files for each directory, zip the two lists.

ADD REPLY
0
Entering edit mode

Yea it was difficult to glean what you meant by 'please explore zip,' but your second comment helps a bit. Doesn't this strategy still require me to know in advance how many datasets the user will provide? I would like the solution to be able to process any number of datasets.

ADD REPLY
2
Entering edit mode
3.1 years ago

OK , I'm not a snakemake guy so here is a nextflow solution. Search for all your vcf and extract the contigs using 'bcftools index -s'

usage: nextflow run workflow.nf --baseDir /path/to/base/directory

Since the merge for each chromosome is independent of other chromosomes, I would like to perform these in parallel

define a config file where 'executor' is sge

https://www.nextflow.io/docs/latest/executor.html#sge

ADD COMMENT
0
Entering edit mode

I'm planning on using snakemake for my pipeline. But maybe this will help someone else that is using nextflow.

ADD REPLY
0
Entering edit mode

I'm a new snakemake user

you should give you the chance to compare both tools at this point.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008622

Using prototyping to choose a bioinformatics workflow management system

ADD REPLY
0
Entering edit mode

Thank you for the suggestion, but I'm struggling enough just to learn one tool. The question specifically asks for snakemake...

ADD REPLY

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6