fastp
v0.9 is released (Nov 29, 2017), with a new feature added:
preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.. See Tutorial: Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integrated
This project is at: https://github.com/OpenGene/fastp
A list of features implemented:
- filter out bad reads (too low quality, too short, or too many N...)
- trim all reads in front and tail
- cut low quality bases for per read in its 5' and 3' by evaluating the mean quality from a sliding window (like Trimmomatic but faster).
- cut adapters (adapters are detected automatically for both PE/SE data).
- correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
- preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name.
- report JSON format result for further interpreting.
- visualize quality control and filtering results on a single HTML page (like FASTQC but faster and more informative).
- split the output to multiple files (0001.R1.gz, 0002.R1.gz...) to support parallel processing.
- support long reads (data from PacBio / Nanopore devices).
- ...
A list of features will be implemented soon:
- Over representation analysis
- Sequencing analysis by lanes/tiles
- Pair merge
- ...
The initial evaluation has shown that fastp
is about 10X faster
than AfterQC
, and also much faster than FASTQC
, Trimmomatic
, Cutadapt
, while providing most features from all of them, plus some novel and useful functions. I had deployed fastp
in my clusters and my cloud system.
I am still calling for new requirements to make it more powerful and useful. If you have good ideas, please reply to this post or file an issue on the github page.
Initial message for call of requirements:
Hi, I am a co-founder of a company owning 10 sets of Illumina NovaSeq sequencers, so my cluster system has to process very large data per day.
Recently I found that the FASTQ data preprocessing has become a bottleneck. Current tools (including tools I developed before) can only provide a part of wanted functions (i.e. QC/filtering/adapter-cutting), and are usually too slow (written in scripts, no good threading).
So I’m planning to write a new all-in-one FASTQ preprocessing tool, which will implement most required functions, and must be written in C/C++ with good multithreading to provide competitive execution performance.
Now I am posting this thread to call for requirement of this tool. If you have any good suggestions, please comment here, or file an issue on the github project ( https://github.com/OpenGene/fastp/issues/new ).
Your contributions are greatly appreciated.
To your point about multithreading, the operations you're proposing can be parallelized using unix paradigms of small programs that do one thing well and then:
filter | cutadapt | ec | qc | sort | compress > out
I think you'll find that most of these operations can become I/O bound very quickly well before you run out of CPU cores. Have you done some benchmarking of current tools that shows you are CPU bottlenecked?
This could potentially be integrated in the
bcl2fastq
so the sequences can be processed before they ever hit a file.Good suggestion. For NovaSeq output (S4 chip, 6Tb / run, 1.5 day),
bcl2fastq
is also very slow and need to be accelerated.In my experience, bcl2fastq parallelizes over samples per lane. Before writing something new, it might make sense to first see if your multiplexing strategy can be tweaked to speed things up a bit.
One NovaSeq flow chip has only 4 lanes, and you cannot feed different lanes with different libraries since there is only one library input. The reads of one sample are distributed in all these lanes, so your method doesn't work.
Ah, right, I forgot that it was like a NextSeq in that regard. That'd make it optimally multiplexed anyway.
You will soon be able to load individual lanes with different libraries. As long as you purchase "NovaSeq Xp" upgrade (which involves some sort of a manifold device, as I understand). I assume an update to control software would be included that will allow processing of individual lanes.
Much of this can be done with BBTools (e.g., trimming and reordering for higher compression). Those are already multithreaded and fairly performant (they're in Java). I imagine that your time would be better spent adding on to that.
With
pigz
(and BBMap) and an isilon storage system capable of (~ 1.2GB/s theoretical throughput) it is not uncommon to see 300-500 MB/s streams.I have experience with BBTools, cool and useful! I like them.
However, if you want to do different processing together (i.e. filtering + trimming + reordering + QC), you have to run different tools separately. That will definitely take more time.
So the basic idea of this
fastp
is to integrate all these tools in a single program with interfaces to flexibly configure the needed functions.Another issue is QC. I prefer JSON format QC result and HTML report, since they can be easily integrated with web applications.
There was a similar post a few months ago: Tool: Collaboration on an empirical QC tool, you could try to contact the people which showed interest in that thread.
Don't forget in general these tasks are embarrassingly parallel, as you can process each fastq file independently, and GNU Parallel can be of great help here.