Biostar Beta. Not for public use.
Getting amount of removed reads after Trimmomatic
0
Entering edit mode
3.8 years ago
United Kingdom

Hi,

I launched Trimmomatic on 300 fastq.gz files to filter low quality reads and now I would like to know the number of sequences that have been removed in each file.

I have fastqc results for each file so I could parse every fastqc html to get this information. I just wondered if there was an easier way to do it.

Thank you,

trimmomatic fastq • 1.4k views
ADD COMMENTlink
2
Entering edit mode
15 months ago
Chris Fields ♦ 2.1k
University of Illinois Urbana-Champaign

This is normally reported in the standard output, which I normally redirect to a log file. A particularly thorny example (see second to last line):

TrimmomaticPE: Started with arguments: -threads 8 -phred33 N5_AGTTCCGT_L008_R1_001.fastq.gz N5_AGTTCCGT_L008_R2_001.fastq.gz N5_AGTTCCGT_L008_R1.paired.fastq.gz N5_AGTTCCGT_L008_R1.unpaired.fastq.gz N5_AGTTCCGT_L008_R2.paired.fastq.gz N5_AGTTCCGT_L008_R2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:15:10 CROP:98 HEADCROP:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:30
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 25510326 Both Surviving: 23679651 (92.82%) Forward Only Surviving: 1373354 (5.38%) Reverse Only Surviving: 332039 (1.30%) Dropped: 125282 (0.49%)
TrimmomaticPE: Completed successfully
ADD COMMENTlink
2
Entering edit mode

I should also add, MultiQC is a great tool to capture the log reports for multiple samples and collate them into a single report with nice summary graphics.

ADD REPLYlink
0
Entering edit mode

Thak you for the answer,

I forgot to record the standard output, that is why I wondered if there was a way to get the information afterward.

I used MultiQC to have a summary of the fastqc reports. But now I am more interested in a simple text file with the name of the sample and the number/percentage of removed sequences.

ADD REPLYlink
1
Entering edit mode

Run FASTQC on both the raw and trimmed data (I typically do this to make sure the trimming addressed any identified issues, unless I skip trimming altogether), then run MultiQC retaining the parent directory information on all the data (there is a swicth for this) or run it on each. In either case, MultiQC generates a tab-delimited text file, so you could then pull the raw and trimmed FASTQC results into R and derive the number of removed reads from that.

ADD REPLYlink
0
Entering edit mode

It worked , Thank you.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3