Merging raw Illumina FASTQ files with snakemake
0
5
Entering edit mode
6.8 years ago
Jokhe ▴ 140

Hi,

I have sequenced DNA samples with Illumina NextSeq which produces 8 FASTQ files per sample. As a part of my snakemake pipeline I would like to merge these files into two larger FASTQ files (forward/reverse) and continue working with these files. I have tried to figure out how this could be done with 'snakemake', so far with no success. My starting point and goal is similar to following example:

#Input files for tumor sample:
./patientAtumor-131216-42530784/patientAtumor_S1_L001_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L001_R2_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L002_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L002_R2_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L003_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L003_R2_001.fastq.gz
./patienttAumor-131216-42530784/patientAtumor_S1_L004_R1_001.fastq.gz
./patientAtumor-131216-42530784/patientAtumor_S1_L004_R2_001.fastq.gz

# Input files for normal (reference) sample
./patientAnormal-131216-42520895/patientAnormal_S2_L001_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L001_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L002_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L002_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L003_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L003_R2_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L004_R1_001.fastq.gz
./patientAnormal-131216-42520895/patientAnormal_S2_L004_R2_001.fastq.gz

 #Desired output would be:
./patientA/patientAtumor_R1.fastq.gz
./patientA/patientAtumor_R2.fastq.gz
./patientA/patientAnormal_R1.fastq.gz
./patientA/patientAnormal_R2.fastq.gz

# Original files can be deleted after merging is complete.

I have tried to construct two 'snakemake' rules that could complete the job but I just can't figure out how I should choose the wildcards. Especially difficult parts in input files are directory names (e.g. -131216-42520895 produced by Illumina BaseSpace) and sample ID in the FASTQ name (e.g. S1 and S2 in examples).

DIRECTORY, SAMPLES, IDS = glob_wildcards("{directory}/{sample}_S{id}_L001_R1_001.fastq.gz")
DIRECTORY, PATIENTS, IDS = glob_wildcards("{directory}/{patient}normal_S{id}_L001_R1_001.fastq.gz"

rule all:
input: expand('{patient}/{sample}_R1.fastq.gz', patient=PATIENTS, sample=SAMPLES),
       expand('{patient}/{sample}_R2.fastq.gz', patient=PATIENTS, sample=SAMPLES)

rule mergeFastqR1:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'

rule mergeFastqR2:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'

The number of wildcards is unequal in different parts of rules which causes problems when running the pipeline. I think that the use of wildcards is the most difficult thing in 'snakemake' especially if you don't have strong background in Python. I can perform the flollowing tas with bash but I would be more than grateful if someone could suggest how this job could be done properly with snakemake. If anyone has time to help and figure out some sort of solution, I own you my gratitude.

Thank you in advance!

snakemake illumina paired-end reads • 6.3k views
ADD COMMENT
3
Entering edit mode

Try not listing an input, just an output and then use cat {wildcard.sample}_S*_R1.fastq.gz > {output} in the shell.

N.B., bash will expand globs (*) alphabetically. If you ever do something like this outside of bash, then double check that that's being done!

ADD REPLY
0
Entering edit mode

Thanks Devon, it works like a charm! It was so easy solution that I didn't even think about it.

ADD REPLY
0
Entering edit mode

Could you be more specific ? What should we put as input then ? How will the mergeFastqR1 rule looks like ? thanks

ADD REPLY
3
Entering edit mode
rule mergeFastqR1:
    output: "{patient}/{patient}{diseaseState}_R1.fastq.gz"
    shell: """
        mkdir -p {wildcards.patient}
        cat {wildcards.patient}{wildcards.diseaseState}*/*_R1_001.fastq.gz > {output}
        """

Or something along those lines

ADD REPLY
0
Entering edit mode

Except when you have a samples with _S in the name - it will merge them. e.g. if you have samples X and X_S - them samples will end up being combined.

ADD REPLY
0
Entering edit mode

Sure, but since such a case wasn't present there was no reason to include it in the solution.

ADD REPLY

Login before adding your answer.

Traffic: 1483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6