This is a beta test.
Question: Trim_galore on multiple fast.qz files: syntax problem
1
Entering edit mode

I was reading the following post about how to run trim_galore on multiple paired-end fastq.gz files TrimGalore! on multiple paired fastq files

I installed GNU parallel and intended to use the same command suggested by eldronzhou:

find  path_to_fastq  -name "*_R1_merged.fastq.gz" | cut -d "_" -f1 | parallel -j 1 trim_galore --illumina --paired --fastqc -o trim_galore/ {}\_R1_merged.fastq.gz {}\_R2_merged.fastq.gz

However, my files are named slighlty differently: XXX_XX_L008_R1_001.fastq.gz and XXX_XX_L008_R2_001.fastq.gz Therefore I changed to command above to the following:

find  path_to_fastq  -name "*_R1_001.fastq.gz" | cut -d "R" -f1 | parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {}R1_001.fastq.gz {}R2_001.fastq.gz

However, I get the following error (showing up once for each pair of *fast.gz files:

Please provide an even number of input files for paired-end FastQ trimming! Aborting ...

I'm guessing something is wrong in my syntax and somehow the order of the files I provide is wrong - maybe R1 and R2 are not given in the right pairs somehow? My *fastq.gz files are in a separate folder and I have 20 files (so 10 pairs).

I cannot work out what is wrong, any help would be deeply appreciated.

UPDATE

After running the following --dry-run command, I get the following output:

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970096_CGAAGGT_L008_R1_001.fastq.gz ../../fastq/E970096_CGAAGGT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970084_CCTTGTC_L008_R1_001.fastq.gz ../../fastq/E970084_CCTTGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970090_TCAGAAG_L008_R1_001.fastq.gz ../../fastq/E970090_TCAGAAG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970092_ACAGTAC_L008_R1_001.fastq.gz ../../fastq/E970092_ACAGTAC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970038_CTGGTTG_L008_R1_001.fastq.gz ../../fastq/E970038_CTGGTTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970094_AGGACTG_L008_R1_001.fastq.gz ../../fastq/E970094_AGGACTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970073_CTAGGTC_L008_R1_001.fastq.gz ../../fastq/E970073_CTAGGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970088_GGATCAT_L008_R1_001.fastq.gz ../../fastq/E970088_GGATCAT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970095_CAGTCAT_L008_R1_001.fastq.gz ../../fastq/E970095_CAGTCAT_L008_R2_001.fastq.gz

I still can't see an obvious mistake...

UPDATE2

I think I may have found the mistake.. I think that somehow, when using GNU parallel, I end up with commands (see just above) lacking the quotes around "--outdir ..."! And so I think the answer is:

find ../../fastq/ -name "*R1_001.fastq.gz" | cut -d "R" -f1 | parallel --dry-run -j 1 trim_galore --paired --fastqc_args \"--outdir /home/user/my_projects/project1/data/qc/trim-galore\" {}R1_001.fastq.gz {}R2_001.fastq.gz

# Output command I want (one example only)
trim_galore --paired --fastqc_args "--outdir /home/user/my_projects/project1/data/qc/trim-galore" ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz
ADD COMMENTlink 17 months ago m93 • 150
Entering edit mode
0

did you count the number of files in path_to_fastq directory? does the folder contain matching R1 and R2 files?

ADD REPLYlink 17 months ago
cpad0112
11k
Entering edit mode
0

Yes, there are 20, so they are in 10 pairs. The folder contains a .txt and a .sha1 file but surely, given my command above, that should not be a problem?

ADD REPLYlink 17 months ago
m93
• 150
Entering edit mode
1

syntax seems to be fine by me after checking few dummy files. Add --dry-run immediately after parallel command. Check the dummy run.

ADD REPLYlink 17 months ago
cpad0112
11k
Entering edit mode
0

This is so bizarre.. The --dry-run returns that my files are in the right pairs!

ADD REPLYlink 17 months ago
m93
• 150
Entering edit mode
0

You would have completed the trim runs by now if you had run them serially :-)

ADD REPLYlink 17 months ago
genomax
68k
Entering edit mode
0

Well true haha but I intend to run this on over 100 samples eventually so I need to know how to do this. I seriously cannot understand what I'm doing wrong

ADD REPLYlink 17 months ago
m93
• 150
Entering edit mode
0

@ole.tange is developer of parallel so the answer below should work.

ADD REPLYlink 17 months ago
genomax
68k
Entering edit mode
0

Why do you need find path_to_fastq -name "*_R1_001.fastq.gz"? A simple ls -1 *_R1_001.fastq.gz should do. Make sure ls -1 *_R1_001.fastq.gz | wc -l gets an equal number as ls -1 *_R2_001.fastq.gz | wc -l.

ADD REPLYlink 17 months ago
genomax
68k
Entering edit mode
0

There are definitely the right number of files when I do those checks. I think its something to do with my find command which is not listing the R1 files in the same order as in the folder.. I tried replacing find with ls -1 but I have the same problem. This is so confusing

ADD REPLYlink 17 months ago
m93
• 150
Entering edit mode
0

In general, for most of the tools, outdirs/ouputs are never quoted. I am not sure trimgalore requirements. I think first you should run the program with barebones command. For eg. remove --fastqc_args in function above.

ADD REPLYlink 17 months ago
cpad0112
11k
1
Entering edit mode

This:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {= s/_R1_/_R2_/ =}

or:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_}

will run:

trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaci_R1_001.fastq.gz path_to_fastq/abaci_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvarks_R1_001.fastq.gz path_to_fastq/aardvarks_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/a_R1_001.fastq.gz path_to_fastq/a_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvark_R1_001.fastq.gz path_to_fastq/aardvark_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalones_R1_001.fastq.gz path_to_fastq/abalones_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaft_R1_001.fastq.gz path_to_fastq/abaft_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacus_R1_001.fastq.gz path_to_fastq/abacus_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacuses_R1_001.fastq.gz path_to_fastq/abacuses_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aback_R1_001.fastq.gz path_to_fastq/aback_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalone_R1_001.fastq.gz path_to_fastq/abalone_R2_001.fastq.gz

If that does not work out of the box, try running each command by hand one at a time.

ADD COMMENTlink 17 months ago ole.tange ♦ 3.4k
Entering edit mode
0

is it necessary to have find line over here? can we not use:

parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_} ::: path_to_fastq/*_R1_001.fastq.gz

ADD REPLYlink 17 months ago
cpad0112
11k
Entering edit mode
0

find is used because OP used find.

Your solution will often give the same result, but will fail if the files are in subdirs inside path_to_fastq or if there are so many that they do not fit on a single command line.

ADD REPLYlink 17 months ago
ole.tange
♦ 3.4k

Login before adding your answer.

Powered by the version 1.6