Getting amount of removed reads after Trimmomatic
1
0
Entering edit mode
7.5 years ago

Hi,

I launched Trimmomatic on 300 fastq.gz files to filter low quality reads and now I would like to know the number of sequences that have been removed in each file.

I have fastqc results for each file so I could parse every fastqc html to get this information. I just wondered if there was an easier way to do it.

Thank you,

trimmomatic fastq • 3.1k views
ADD COMMENT
2
Entering edit mode
7.5 years ago
Chris Fields ★ 2.2k

This is normally reported in the standard output, which I normally redirect to a log file. A particularly thorny example (see second to last line):

TrimmomaticPE: Started with arguments: -threads 8 -phred33 N5_AGTTCCGT_L008_R1_001.fastq.gz N5_AGTTCCGT_L008_R2_001.fastq.gz N5_AGTTCCGT_L008_R1.paired.fastq.gz N5_AGTTCCGT_L008_R1.unpaired.fastq.gz N5_AGTTCCGT_L008_R2.paired.fastq.gz N5_AGTTCCGT_L008_R2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:15:10 CROP:98 HEADCROP:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:30
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT'
Using Long Clipping Sequence: 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 25510326 Both Surviving: 23679651 (92.82%) Forward Only Surviving: 1373354 (5.38%) Reverse Only Surviving: 332039 (1.30%) Dropped: 125282 (0.49%)
TrimmomaticPE: Completed successfully
ADD COMMENT
2
Entering edit mode

I should also add, MultiQC is a great tool to capture the log reports for multiple samples and collate them into a single report with nice summary graphics.

ADD REPLY
0
Entering edit mode

Thak you for the answer,

I forgot to record the standard output, that is why I wondered if there was a way to get the information afterward.

I used MultiQC to have a summary of the fastqc reports. But now I am more interested in a simple text file with the name of the sample and the number/percentage of removed sequences.

ADD REPLY
1
Entering edit mode

Run FASTQC on both the raw and trimmed data (I typically do this to make sure the trimming addressed any identified issues, unless I skip trimming altogether), then run MultiQC retaining the parent directory information on all the data (there is a swicth for this) or run it on each. In either case, MultiQC generates a tab-delimited text file, so you could then pull the raw and trimmed FASTQC results into R and derive the number of removed reads from that.

ADD REPLY
0
Entering edit mode

It worked , Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6