Biostar Beta. Not for public use.
Question: How to speed up trimming using trim galorer
Entering edit mode

I am currently trying to follow a scRNA-seq pipeline, however I have encountered a couple of problems.When trying to trim a paired-end read sample to remove reads with a quality score below 20 using trim galore, for one particular sample it is has been running for over 12 hours and has only done 4 out of the 19 paired end reads. Is there a reason for this, or any way that we can speed up this process? The fastq files are really large for these samples - some are over 5 million KB, and there are 19 paired reads files. Other samples worked fine before and other tools do not work properly, only trim galore works well.

 for i in *_1.fastq.gz;
                     -q 20 
                     -o trimmed “$i” “${i%_1.fastq.gz}_2.fastq.gz“; 
ADD COMMENTlink 13 months ago lamia_203 • 30 • updated 13 months ago mbk0asis • 420
Entering edit mode

other tools do not work properly, only trim galore works well

I find that hard to believe. There are threaded trimming tools (e.g. from BBMap suite) that will work as fast as your disk I/O allows and number of cores you have available. That said if your computer is I/O bound (e.g. you are using a regular spinning disk) then things may already be at their peak limits.

In addition to the solution below you can also look into using parallel : Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD REPLYlink 13 months ago
Entering edit mode

Use GNU Parallel!

Your 'for-loop' code processes one sample at a time. (it waits until one sample is finished and starts next.)

Use 'parallel' to process multiple samples simultaneously!

If you have reads like this;



and so on

try something like below;

ls -1 *.fq.gz | cut -d. -f1 | sort | uniq | parallel -j 10 'trim_galore --paired {}.R1.fq.gz {}.R2.fq.gz'

note) change -j parameter to adjust the number of jobs you want to run simultaneously.
ADD COMMENTlink 13 months ago mbk0asis • 420
Entering edit mode


parallel --plus 'trim_galore --paired {...}.R1.fq.gz {...}.R2.fq.gz' ::: *R1.fq.gz
ADD REPLYlink 13 months ago
♦ 3.4k
Entering edit mode

Much better! Thanks~

ADD REPLYlink 13 months ago
• 420
Entering edit mode

Note that if you use GNU parallel, you're required to cite it in your publications:, which is why my answer suggested xargs -P instead for a simple use case as this (in addition of being a standard tool on linux distribution, if you intend to redistribute your code).

However I'll say one advantage of parallel is that you can distribute your load to different machines using -S,,, though that's a little bit more involved.

ADD REPLYlink 13 months ago
• 830
Entering edit mode

Depending on your machine's memory and number of CPUs available, you could try to parallelize it. The for loop you have does them one by one. Check current usage with top, and see how many you can do in parallel. If you estimate that you can run 10 jobs at the same time, you could do something like:

ls *_1.fastq.gz | xargs -P10 -I@ bash -c 'trim_galore -q 20 --paired -o trimmed "$1" ${1%_1.*.*}_2.fastq.gz' _ @

Otherwise, see if there's any parameter you can use in your Trim Galore call (e.g. Possibly using --dont_gzip could be faster, and you can gzip your files later.)

More detail about the command: Instead of looping, it pipes the list of files to xargs which uses 10 parallel process (-P10) and substitutes the @ in the following command with each input file. To keep a similar shell substitution like you had, it's calling a subshell (bash -c) while _ is just a placeholder variable that gets set to $0 (the process name) and your input fastq.gz file becomes $1.

ADD COMMENTlink 13 months ago manuel.belmadani • 830

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0