Snakemake how to use glob_wilcards properly
1
1
Entering edit mode
3.1 years ago
Peter Chung ▴ 200

I have many paired fastq files and I have a problem on after running trim_galore package, as it named the fastq files with _1_val_1 and _2_val_2, for example: AD50_CTGATCGTA_1_val_1.fq.gz and AD50_CTGATCGTA_2_val_2.fq.gz.

I would like continue snakemake and use

import os
import snakemake.io
import glob

DIR="AD50"
(SAMPLES,READS,) = glob_wildcards(DIR+"{sample}_{read}.fq.gz")
READS=["1","2"]

rule all:
    input:
        expand(DIR+"{sample}_dedup_{read}.fq.gz",sample=SAMPLES,read=READS)

rule clumpify:
    input:
        r1=DIR+"{sample}_1_val_1.fq.gz",
        r2=DIR+"{sample}_2_val_2.fq.gz"
    output:
        r1out=DIR+"{sample}_dedup_1.fq.gz",
        r2out=DIR+"{sample}_dedup_2.fq.gz"
    shell:
        "clumpify.sh in={input.r1} in2={input.r2} out={output.r1out} out2={output.r2out} dedupe subs=0"

and the error is:

Building DAG of jobs...
MissingInputException in line 13 of /home/peterchung/Desktop/Rerun-Result/clumpify.smk:
Missing input files for rule clumpify:
AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz
AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz

I tired another way, somehow the closest is that it detected the missing input like AD50_CTGATCGTA_1_val_2.fq.gz and AD50_CTGATCGTA_2_val_1.fq.gz which is not exist.

I am not sure the glob_wildcards function I used properly since there are many underscore in it. I tired:

 glob_wildcards(DIR+"{sample}_{read}_val_{read}.fq.gz")

but it did not work as well.

glob wildcards snakemake python • 2.8k views
ADD COMMENT
0
Entering edit mode
3.1 years ago

If it's not too late and if feasible for you, I would run trim_galore within your snakefile. That is, add a rule that takes in input the raw fastq files and produces the trimmed *_val_* files. Maybe something like:

rule trim_galore:
    input:
        r1= DIR+"{sample}_1.fq.gz",
        r2= DIR+"{sample}_2.fq.gz",
    output:
        r1=DIR+"{sample}_1_val_1.fq.gz",
        r2=DIR+"{sample}_2_val_2.fq.gz",
    shell:
        r"""
        trim_galore ... {input} {output} 
        """

If you have trimmed files and you don't want to do the trimming again, make sure they are in a directory consistent with the code in the snakefile. Snakemake complains that it cannot find AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz. So, where do you have the trimmed files and how are they named? Showing the command line you use to run snakemake would also help.

ADD COMMENT
0
Entering edit mode

my file name is AD50_CTGATCGTA_1_val_1.fq.gz and AD50_CTGATCGTA_2_val_2.fq.gz

my command line is:

snakemake -j4

I think my glob_wildcards is incorrect. it recognises AD50_CTGATCGTA_1_val and AD50_CTGATCGTA_2_val as sample and that's why it stated AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz is missed.

but I want only AD50_CTGATCGTA as sample and let it recognise _1_val_1 and _2_val_2 rather than _1_val_2 or _2_val_1 which it is not exist. Thanks for your reply.

ADD REPLY

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6