Question

Wildcards in Snakemake

0

Entering edit mode

6 months ago

Begonia_pavonina ▴ 150

I wish to input the names of my samples in a table with the corresponding files of reads forward and reverse to use them in a Snakemake workflow:

sample  fq1 fq2
1   ../reads/110627_0240_AC0254ABXX_2_SA-PE-001.1.fq.gz ../reads/110627_0240_AC0254ABXX_2_SA-PE-001.2.fq.gz
2   ../reads/110627_0240_AC0254ABXX_2_SA-PE-002.1.fq.gz ../reads/110627_0240_AC0254ABXX_2_SA-PE-002.2.fq.gz
22  ../reads/110802_0249_AD0CM0ABXX_3_SA-PE-022.1.fq.gz ../reads/110802_0249_AD0CM0ABXX_3_SA-PE-022.2.fq.gz

Unfortunately, the management of the wildcards is more complex and the error "No values given for wildcard 'samples'." appear with this workflow. Would anyone know how to manage this kind of input for Snakemake?

import os 
import snakemake.io 
import glob 
import pandas as pd

#Config file: 
configfile: "config/config.yaml"

# Load the samples table: 
table=pd.read_csv("config/samples_reads.tsv", delim_whitespace=True)

# Use the samples table to make lists of samples names/lists: 
samples=table['sample'] 
read1=table['fq1'] 
read2=table['fq2']

# The first rule: 
rule all:
        input:
                expand("mapped/{samples}.bam")

# Alignment: 
rule bwa_mem:
    input:
        reads=[expand("reads/{read1}"), expand("reads/{read2}")],
        idx=multiext("genome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
    output:
        expand("mapped/{samples}.bam"),
    log:
        "logs/bwa_mem/{samples}.log",
    params:
        extra=r"-R '@RG\tID:{samples}\tSM:{samples}'",
        sorting="none",  # Can be 'none', 'samtools' or 'picard'.
        sort_order="queryname",  # Can be 'queryname' or 'coordinate'.
        sort_extra="",  # Extra args for samtools/picard.
    threads: 8
    wrapper:
        "v2.6.0/bio/bwa/mem"

Python workflow Snakemake wrapper • 911 views

ADD COMMENT • link updated 6 months ago by Jeremy Leipzig 22k • written 6 months ago by Begonia_pavonina ▴ 150

score 2 · Answer 1 · 2023-10-18

2

Entering edit mode

6 months ago

Jeremy Leipzig 22k

the samples in expand("mapped/{samples}.bam") is not the samples in the outer scope. You need to define it in the expand i.e. expand("mapped/{samples}.bam",samples=samples) and make sure samples is a list, not a series

ADD COMMENT • link 6 months ago by Jeremy Leipzig 22k

0

Entering edit mode

Thanks for the help, I have modified the script and this specific problem has been solved:

import os
import snakemake.io
import glob
import pandas as pd

# Config file
configfile: "config/config.yaml"

# Load the samples table:
table=pd.read_csv("config/samples_reads.tsv", delim_whitespace=True)

# Use the samples table to make lists of samples names/lists:
SAMPLES=table['sample'].to_list()
R1=table['fq1'].to_list()
R2=table['fq2'].to_list()

# The first rule (here rule all) specifies the files that you would like to create during your snakemake workflow.
rule all:
        input: 
                expand("mapped/{sample}.bam", sample=SAMPLES)

rule bwa_mem:
    input:
        reads=[expand("reads/{read1}", read1=R1 ), expand("reads/{read2}", read2=R2)],
        idx=multiext("genome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
    output:
        expand("mapped/{sample}.bam", sample=SAMPLES),
    log:
        expand("logs/bwa_mem/{sample}.log", sample=SAMPLES),
    params:
        extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
        sorting="none",  # Can be 'none', 'samtools' or 'picard'.
        sort_order="queryname",  # Can be 'queryname' or 'coordinate'.
        sort_extra="",  # Extra args for samtools/picard.
    threads: 8
    wrapper:
        "v2.6.0/bio/bwa/mem"

However, this method to input samples and reads with different names does not work, it seems Snakemake does not like when samples and reads names differ.

Building DAG of jobs...
MissingInputException in rule bwa_mem in file /mnt/shared/scratch/usr/comp_baits_pseudo-chr/baits/workflow/Snakefile, line 31:
Missing input files for rule bwa_mem:
    output: mapped/A.bam, mapped/B.bam, mapped/C.bam
    affected files:
        reads/110802_0249_AD0CM0ABXX_3_SA-PE-022.1.fq.gz
        reads/110627_0240_AC0254ABXX_2_SA-PE-001.2.fq.gz
        reads/110627_0240_AC0254ABXX_2_SA-PE-001.1.fq.gz
        reads/110802_0249_AD0CM0ABXX_3_SA-PE-022.2.fq.gz
        reads/110627_0240_AC0254ABXX_2_SA-PE-002.2.fq.gz
        reads/110627_0240_AC0254ABXX_2_SA-PE-002.1.fq.gz

If anyone has a way around it, I would be interested in.

ADD REPLY • link 6 months ago by Begonia_pavonina ▴ 150

2

Entering edit mode

You should review some example Snakefiles to understand how wildcards work in rules - the idea is that the wildcards match between input and output. Using your current table you would need to write a lookup lambda function for Snakemake to map the sample name (e.g. 1) to the reads (e.g. ../reads/110627_0240_AC0254ABXX_2_SA-PE-001.1.fq.gz). An easier solution would be for you to create a bam column in your table like 110627_0240_AC0254ABXX_2_SA-PE-001.bam so Snakemake could derive the read names from an implicit pattern.

ADD REPLY • link 6 months ago by Jeremy Leipzig 22k

0

Entering edit mode

Thank you Jeremy Leipzig, actually I have already spent some time trying to understand how the dna-seq-gatk-variant-calling workflow works, as it is using as well a .tsv file to use different samples and read names. https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/workflow/rules/common.smk

Unfortunately it is a bit too complex for me, and it is why I came to this forum for an answer. Now I know the best way is to use lambda function, and I will work on it, thanks to you!

A more general question would be why all Snakemake users do not want to do the same thing than me? Why would you keep the unreadable reads names on all you Snakemake workflow?

ADD REPLY • link 6 months ago by Begonia_pavonina ▴ 150

1

Entering edit mode

i would say bioinformatics has an uneasy relationship with filenames in general. the tendency is to name things logically but as a general rule you shouldn't put metadata in filenames. i think it's a good idea to use a lookup table that stores a variety of metadata, as you have started. it will just take some time to learn to use it in Snakemake.

ADD REPLY • link 6 months ago by Jeremy Leipzig 22k

0

Entering edit mode

I think the expand() function is not necessary in the input, but I am not sure how to manage the {read1} variable and read1=R1 without this function.

ADD REPLY • link 6 months ago by Begonia_pavonina ▴ 150

0

Entering edit mode

sample should remain a wildcard in the bwa_mem rule. For instance,

rule all:
    input: 
        expand("mapped/{sample}.bam", sample=SAMPLES)

rule bwa_mem:
    input:
        reads=expand("reads/{{sample}}_{reads}.fq.gz", reads=[1,2])
    output:
        "mapped/{sample}.bam"

ADD REPLY • link 6 months ago by Eric Lim ★ 2.1k

0

Entering edit mode

Thanks for the answer @ericlim. The purpose of using a .tsv file was to input reads names and output samples names that are differents (see the .tsv file example above). Is it not possible in Snakemake?

ADD REPLY • link 6 months ago by Begonia_pavonina ▴ 150

0

Entering edit mode

It is possible; it's just not a common use case as wildcards generally match between inputs and outputs. As Jeremy suggested below, you'd need a lambda/function to provide the mapping yourself.

ADD REPLY • link 6 months ago by Eric Lim ★ 2.1k