String Of "B" In Illumina Qseq Files From Paired-End Sequencing Run - Yet Qc Passed
1
2
Entering edit mode
12.1 years ago

Hi @ll!

I have a question regarding the way the Illumina pipeline generates its quality check status in the qseq files (11th column according to information from here: http://jumpgate.caltech.edu/wiki/QSeq):

Please take a look at this (representative) example (I've removed the machine ID):

1st paired-end read: HWUSI-XXXXXX    11      7       120     19847   19200   0       1       .AATGATATAGAATGGAATTGAATGGAATGTGCGTGAATGGAATG   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB   1
2nd paired-end read: HWUSI-XXXXXX    11      7       120     19847   19200   0       3       TCTATTCCTTTCTAATCCATTCAATTCCATTTCATTCGATTCCAT   hfhghhcghhhghhhfff]ffhgchhhcghfheehdfdfafffff   1

According to my interpretation of the qseq data format, the 1st paired-end read has passed Illumina QC ("1" in last column of the line), even though the whole read should be disregarded according to PHRED score B(=2). How is it then possible that this read passed the QC? This is one pair of a paired end read, and the matching read from the second file has actually passed the QC and does have a better overall PHRED score (see above) - could this be the reason? I.e. does the Illumina pipeline consider the "overall" quality of a sequence if it is a pair-ended read?

My issue is that nearly 10% of the reads fall into this category (QC passed, yet Bs for all positions). At this stage I am planning to remove these reads prior to alignment, but I would appreciate some comments/answers from people who have seen similar reads in their experiments.

Thanks in advance!

illumina paired next-gen sequencing qc • 3.9k views
ADD COMMENT
1
Entering edit mode
12.1 years ago
Eric Fournier ★ 1.4k

From the Wikipedia page on the FASTQ Format:

The Phred scores 0 to 2 in Illumina 1.5+ have a slightly different meaning. The values 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator [6]. The Illumina manual[7] (page 30) states the following: If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores)... This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses.

ADD COMMENT
0
Entering edit mode

Thanks, but I've read this paragraph several times now, and it does not help me understand how a read can pass QC if all bases have Phred score of "2". If this is a read segment quality control indicator, then - naively - I would assume that a read only with "2"/"B" would not pass QC. And even more confusingly, there are reads that have only "2"/"B" Phred scores and in fact do NOT pass QC (0 in column 11).

ADD REPLY
0
Entering edit mode

I'm no expert on the criteria the Illumina pipeline applies within its QC, but I thought the important part of the passage I quoted was "This Q2 indicator [...] indicates that a specific final portion of the read should not be used in further analyses." IE, you would be doing the right thing in removing those reads.

ADD REPLY
0
Entering edit mode

And the other part of the point of the quote is that all the Bs do NOT mean the quality was extremely low, just that it was "mostly Q15 or below". The QC filtering as it currently works wouldn't necessarily be expected to filter these out (although I agree that it might be helpful for them to modify it so anything that "should not be used in further analyses" would be filtered out).

ADD REPLY

Login before adding your answer.

Traffic: 1542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6