I'm a new snakemake user and struggling with a problem that I have yet to find an answer for from my googling efforts. I am attempting to merge vcf files (split by chromosome) from several different WGS datasets into a single vcf file per chromosome using "bcftools merge". My script structure is as follows.
A .tsv file ('input_files.tsv') where the user can supply the path and prefix of each dataset:
path prefix
/path/to/d1/ dataset1_Chr
/path/to/d2/ dataset2_Chr
where the files in each folder would look like this:
/path/to/d1/dataset1_Chr01.vcf.gz
/path/to/d1/dataset1_Chr02.vcf.gz
/path/to/d2/dataset2_Chr01.vcf.gz
/path/to/d2/dataset2_Chr02.vcf.gz
I have snakefile that looks similar to this:
import pandas as pd
input_files = pd.read_tsv('input_files.tsv')
rule merge_vcf:
input: ????
output: "merged_Chr{num}.vcf"
shell: "bcftools merge {input} -o {output}"
My question is two-fold:
1) How can I construct the input and output filenames such that snakemake will sequentially merge the Chr01 files for all datasets together, then merge the Chr02 files for each dataset together, and so on.... without merging Chr01+Chr02 into the same file.
2) Since the merge for each chromosome is independent of other chromosomes, I would like to perform these in parallel (I'm using a SLURM workload manager on a HPC cluster here). I came across the "group" parameter in the snakemake docs, however, it's unclear to me if this parameter does what I want.
I recognize that I can explicitly define the files for each dataset in the input, like so:
rule merge_vcf:
input:
d1 = expand("dataset1_Chr{num}.vcf.gz", num = ['01', '02'])
d2 = expand("dataset2_Chr{num}.vcf.gz", num = ['01', '02'])
output: "merged_Chr{num}.vcf"
shell: "bcftools merge {input.d1} {input.d2} -o {output}"
however, that requires me to know in advance how many datasets the user wants to merge. I would like this script to take a dynamic number of datasets in the input file without having to modify the snakefile. I know that I can pass a python list of ALL the files to the input parameter, but then I don't understand how to tell bcftools to only merge Chr01 files together, then Chr02, and so on..
Any help is much appreciated.
please explore zip function in snakemake.
Are you sure 'zip' can be used to solve this?
expand("{dataset}{num}.vcf.gz", zip, dataset = ['dataset1_Chr', 'dataset2_Chr'], num = ['01', '02'])
would yield:
dataset1_Chr01.vcf.gz, dataset2_Chr02.vcf.gz
for the inputs, but I need:
dataset1_Chr01.vcf.gz, dataset2_Chr01.vcf.gz
...then...
dataset1_Chr02.vcf.gz, dataset2_Chr02.vcf.gz
as my inputs. Am I misunderstanding how to use zip?
That is not what I meant. Create two lists for files for each directory, zip the two lists.
Yea it was difficult to glean what you meant by 'please explore zip,' but your second comment helps a bit. Doesn't this strategy still require me to know in advance how many datasets the user will provide? I would like the solution to be able to process any number of datasets.