Question

Is it reasonable to discard reads that show variation of quality across its length?

0

Entering edit mode

29 days ago

BRENO • 0

Hello, I have noticed that some reads in my paired-end dataset have bases with bad quality distributed along their length. I think this kind of reads are represented in the outliers that appear in my quality per base position box and whiskers plot. Do you think it is a good idea to try to remove them?

I have tried with trimmomatic using SLIDINGWINDOW:4:20 and they get discarded. However, I'm not sure why that happens for some reads, because when doing the manual calculations, it seems like they should not be discarded. Instead of a sliding average window, maybe I could be using a threshold of the percentage of base qualities that fall below 20 with fastp. What is your opinion?

Please find attatched my quality plot here: This plot shows the quality score in the Y axis and the base position in the X axis. The stars show outlier values for each position

Also, I leave a dot plot of the qualities in two of those reads from the same R1 file: The structure of this one is analogous to the previous plot, but it shows the qualities for just one read as dots.

Trimming • 410 views

ADD COMMENT • link 29 days ago by BRENO • 0

0

Entering edit mode

The question you should be looking into is why the Q scores are dropping there? Are there N calls indicating some issue with that cycle?

ADD REPLY • link 29 days ago by GenoMax 141k

0

Entering edit mode

Thank you for replying. I have separated the reads with at least one N on them and they account for 0.2% of the total reads in the file (8'149 out of 3'750'185). The respective plot is attatched below. Ns are the lowest values in the graph.

How could I assess if this represents an issue in the cycle? Thanks in advance.

ADD REPLY • link 29 days ago by BRENO • 0

0

Entering edit mode

This is going to be tough to diagnose. Are you going to align to a reference? If so go ahead with the alignments to see if alignments turn out to be fine. They may well might.

It is possible that the sequencer was having some sort of an issue (that led to the lower Q scores) but not enough to lead to loss of sequence.

What sequencer is this data from?

ADD REPLY • link 29 days ago by GenoMax 141k

0

Entering edit mode

Yikes. Well, this is from an Illumina MiniSeq. Sure, I will use alignment, but only to remove host reads. Then the remaining reads will go through SPADES, from the geneious version.

ADD REPLY • link 29 days ago by BRENO • 0