Hi folks,
I usually use bash scripts to get things done, and I an going to try my hand at setting up some snakemake scripts to see if I can improve some automation and tracking for some currently ad-hoc analyses.
I've gone through the examples on this page: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html
But I'm still unclear about how to handle variables to the point where I can do some scatter-gather work. For example, here is a rough outline of what I do in bash, but would like to do in snakemake.
# split the genome into chunks and variant call in parallel
bcftools/vcfutils.pl splitchr -l 1000000 GRCh37-lite.fa.fai | xargs -i echo "gatk-4.0.10.0/gatk --java-options "-Xmx4g" HaplotypeCaller -R GRCh37-lite.fa -I ./GM24385.bam -O ./.{}.vcf.gz --pcr-indel-model NONE -L {}" | parallel -j 20 --eta
# gather the VCF files back into one big VCF
vcftools_0.1.12a/bin/vcf-concat $( ls .*vcf.gz ) > concat.vcf
# sort the VCF
cat concat.vcf | vcftools_0.1.12a/bin/vcf-sort > concat.sorted.vcf
Are there any resources/examples out there showing how to do this or something similar in snakemake?
thanks, Richard
The ability to do this is quite new, have a look here for the relevant documentation.
Based on the tutorial page suggested above I got this far, but am now getting "SyntaxError in line 13 of Snakefile: invalid syntax" Errors. I suspect it is due to my use of quotes in the shell command. I'm not really sure I'm on the right track so please send along any suggestions.
thanks RIchard
yes if you need to use a lot of quotes then just put your whole shell in a triple-quote block """
Great! That seemed to help. Now it looks like I'm struggling to understand the wildcards. In the code below I want to process each BAM files in the folder "input/" to create 1 final VCF per BAM file.
My current error is "RuleException in line 7 of /projects/rcorbettprj2/GATK4.0.10_snakemake/Snakefile: NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}} "
My SAMPLES variable contains the list of BAM file names that I expect, but I can't seem to pass that variable into
Thanks for everyone's help with this. I was a bit surprised while trying to learn this that there weren't more examples available.
rawCallsDir=directory("outDir/{sample}")
is not going to work (set this to a real file), nor isoutDir/{sample}/{}.vcf.gz
since the second braces need to be escaped so Snakemake knows not to touch themThanks Jeremy,
For the directory command, I was hoping to create a folder to hold the smaller VCF files per sample. That way it keeps all the files for each sample separate. For example, I have the following in "input" bam1.bam bam2.bam bam3.bam
For each of the bam files I wanted to create a folder in which I will create ~3000 small VCF files. The VCF files in each folder will then be merged to create one merged VCF file per starting BAM file. I was hoping to use separate folders for the temporary small files so that if another sample gets added later, the merges won't get triggered for the files that were created before. Perhaps snakemake is smart enough to take care of this with the checkpoint command?
Snake is smart enough if you let Snakemake keep track of the intermediates. I don't see the point of using
parallel
with Snakemake - that's what -j is for.