Hello, I have noticed that some reads in my paired-end dataset have bases with bad quality distributed along their length. I think this kind of reads are represented in the outliers that appear in my quality per base position box and whiskers plot. Do you think it is a good idea to try to remove them?
I have tried with trimmomatic using SLIDINGWINDOW:4:20 and they get discarded. However, I'm not sure why that happens for some reads, because when doing the manual calculations, it seems like they should not be discarded. Instead of a sliding average window, maybe I could be using a threshold of the percentage of base qualities that fall below 20 with fastp. What is your opinion?
Please find attatched my quality plot here:
Also, I leave a dot plot of the qualities in two of those reads from the same R1 file:
The question you should be looking into is why the Q scores are dropping there? Are there
N
calls indicating some issue with that cycle?Thank you for replying. I have separated the reads with at least one N on them and they account for 0.2% of the total reads in the file (8'149 out of 3'750'185). The respective plot is attatched below.
How could I assess if this represents an issue in the cycle? Thanks in advance.
This is going to be tough to diagnose. Are you going to align to a reference? If so go ahead with the alignments to see if alignments turn out to be fine. They may well might.
It is possible that the sequencer was having some sort of an issue (that led to the lower Q scores) but not enough to lead to loss of sequence.
What sequencer is this data from?
Yikes. Well, this is from an Illumina MiniSeq. Sure, I will use alignment, but only to remove host reads. Then the remaining reads will go through SPADES, from the geneious version.