Hello everyone,
I want to run the rule 'assessment' once after classify_sample
rule is done. The program in the rule assessment only takes a folder of the classify_sample
results. However, when I use the output of the classify_sample
as input for the rule assessment, it naturally runs so many times, as much as the given input (the code lines that I commented out). Otherwise, how should I ensure that the assessment rule runs after classify_sample
?
Based on the documentation, I tried out using touch()
function to cue the assessment rule to run after classify_sample
. but it gave me the following error:
SyntaxError: Not all output, log and benchmark files of rule classify_sample contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file
It seems like it identified the 'classification.done' document as an output to be produced every time the program runs. Therefore, it asks me to have consistent wildcard usage. But I need this document only to be produced once after the classify_sample rule is done, then enforce the assessment rule to run with this document as an input.
I haven't figured out what seems to be the correct way to ensure the assessment rule runs after the classify_sample rule. I hope anyone can give me some advice.
import glob
FASTA_EXT=('*.fasta.gz', '*.fna.gz', '*.fa.gz', '*.fasta', '*.fna', '*.fa')
REFERENCES_FILES=[]
for ext in FASTA_EXT:
REFERENCES_FILES.extend(glob.glob(config["path_to_references"]+ext))
REFERENCES = [os.path.basename(ref).split('.')[0] + '.' + os.path.basename(ref).split('.')[1] for ref in REFERENCES_FILES]
FASTQ_EXT=('*.fq.gz', '*.fastq.gz', '*.fq', '*.fastq')
SAMPLE_PATH = config["path_to_samples"]
SAMPLE_FILES = []
for ext in FASTQ_EXT:
SAMPLE_FILES.extend(glob.glob(config["path_to_samples"]+ext))
SAMPLES = [os.path.basename(file).split('.')[0] for file in SAMPLE_FILES]
rule all:
input:
# expand(config["path_to_assessment_result"]+'result_all.csv'),
# expand(config["path_to_assessment_result"] + '{reference}/{sample}-k25.PNG', sample=SAMPLES, reference=REFERENCES),
expand(config["path_to_classification_result"] + "{reference}/{sample}-k25.log", sample=SAMPLES, reference=REFERENCES)
rule assessment:
output:
csv=config["path_to_assessment_result"]+'result_all.csv',
# visual=config["path_to_assessment_result"] + '{reference}/{sample}-k{k}.PNG',
input:
# log=glob.glob(config["path_to_classification_result"] + "**/{sample}-k{k}.log"),
# config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log"
"classification.done"
params:
path=config["path_to_assessment_result"],
path_logs=config["path_to_classification_result"],
threshold=config["threshold"],
shell:
'python3 assessment.py -logdir {params.path_logs} -threshold {params.threshold} -out {params.path}'
input_sample = []
for ext in FASTQ_EXT:
input_sample.extend(glob.glob(SAMPLE_PATH + "*" + ext))
rule classify_sample:
output:
host=config["path_to_classification_result"] + "{reference}/{sample}-k{k}-host.fq.gz",
# other output files: -graft, -both, -neither, -ambiguous
log=config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log",
cue_file=touch("classification.done")
input:
index_info=config["path_to_index_result"] + "xengsort-{reference}-k{k}.info",
index_hash=config["path_to_index_result"] + "xengsort-{reference}-k{k}.hash",
in1=input_sample
log:
config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log",
benchmark:
config["path_to_classification_result"] + "{reference}/{sample}-k{k}.benchmark"
params:
index=lambda wc: f"{DIRECTORY_OUTPUT}index_result/xengsort-{wc.reference}-k{wc.k}",
outprefix=lambda wc: f"{DIRECTORY_OUTPUT}classification_result/{wc.reference}/{wc.sample}-k{wc.k}",
chunksize=16.0, # MB per chunk per thread
#numactl="numactl -N 1 -m 1 ",
prefetch=1,
debug="-DD", # debug level 2
threads: 40
shell:
" xengsort {params.debug} classify --out {params.outprefix} --index {params.index} "
" --threads {threads} --fastq {input.in1} "
" --chunksize {params.chunksize} --prefetchlevel {params.prefetch} &> {log}"
Thank you.
Best,
Fadlilah
your classify_sample rule is just an implicit wildcard rule. How is Snakemake supposed to know what explicit files it should produce if you don't specify an explicit output(s) like
mypath/hg38/sample1-k2-host.fq.gz
it's actually chunck of the whole code. I had the wildcards specified in rule all.
which I got the SAMPLES an REFERENCES list from glob.glob()
i see. so classification.done will be touched on every iteration of that rule, which isn't what you want. put your touch command in
rule all
Thank you for the advice. But I don't understand yet, how does it suppose to help ensure rule assessment to run after classify_sample?
your sentinel file should work.
rule assessment
is really your final targetIt worked like I intended to. Thank you very much!