Question

Is it Acceptable to Have Uniform Quality Scores in a FASTQ File?

0

Entering edit mode

10 days ago

Κοσμάς • 0

I recently obtained a FASTQ file containing biological sequence data, and upon inspection, I noticed that all quality scores across the entire sequence and for every read are uniform. Is this acceptable, or does it indicate a problem with the data?

I give an example of one read but every single has the same quality score:

@Sequence_ID_read_No2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
????????????????????????????????????????????????????????????????????

In my FASTQ file, each quality score is represented by a question mark ('?')

Could someone please clarify whether uniform quality scores in a FASTQ file are acceptable? Under what circumstances might this occur, and what implications does it have for downstream analysis?

Any insights or guidance would be greatly appreciated. Thank you!

FASTQ • 502 views

ADD COMMENT • link updated 9 days ago by Istvan Albert 100k • written 10 days ago by Κοσμάς • 0

0

Entering edit mode

Which technology is that data from? In theory it is possible to have the same score. Whether it happens to be by chance or by design (e.g. fake scores) may need to be checked. Illumina does score binning anyway. https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf

ADD REPLY • link 10 days ago by GenoMax 142k

0

Entering edit mode

I had illumina instrument platform for RNA-Seq Analysis

ADD REPLY • link 10 days ago by Κοσμάς • 0

0

Entering edit mode

It is not bad to have uniform scores in general, but they are never that uniform as in your example. An educated guess, based on those scores and sequence read names, is that these are simulated scores of some kind. Maybe the sequences were excessively trimmed until all the scores were of the same quality.

ADD REPLY • link 10 days ago by Mensur Dlakic ★ 27k

score 2 · Answer 1 · 2024-04-22

2

Entering edit mode

10 days ago

dsull ★ 6.0k

Yes, that's perfectly fine -- the reason everything is set to ? (phred score 30) is to compress the data. SRA now enables this option when downloading sequencing data. See https://www.ncbi.nlm.nih.gov/sra/docs/sra-data-formats/

ADD COMMENT • link 10 days ago by dsull ★ 6.0k

1

Entering edit mode

I interpret this as an admission by SRA that phred scores are practically useless. The needless granularity (fake precision) was obvious from the start, but it seems like, in the end nothing is lost if we lose that score altogether.

ADD REPLY • link 9 days ago by Istvan Albert 100k

0

Entering edit mode

The needless granularity (fake precision) was obvious from the start

You are assuming that all data generated is always of great quality. There are instances where data can be of pool quality (based on the Q scores) and in those cases having the Q scores is essential to make some hard decisions. So having them is useful before the data ever gets to SRA. That said no sequencing provider should be letting this kind of data go through to the end users.

ADD REPLY • link 9 days ago by GenoMax 142k

0

Entering edit mode

What I mean is that no sequencing run can be calibrated with the claimed precision. Where it could correctly distinguish between basecalls with phred scores of 30 vs 31 that translate to error rates of 1/10^(30/10) vs. 1/10^(31/10)

The error in the phred score was always far bigger than the reported precision.

It is called "significant figure", it makes no sense to report a measurement as 9.233421323 if the error is 0.1, The value should be reported as 9.2 +/- 0.1

the same way - the error in the phred score is probably in the tens, or even higher, so it never made sense in my opinion to report the FASTQ scores at the granularity that they are reported.

ADD REPLY • link 9 days ago by Istvan Albert 100k

0

Entering edit mode

I agree for the most part. However, Heng Li has argued that the one application that they are essential for is short read variant calling. https://lh3.github.io/2020/05/27/base-quality-scores-are-essential-to-short-read-variant-calling

He states: "This is because low-quality Illumina sequencing errors are correlated, in that if one low-quality base is wrong, other low-quality bases tend to be wrong in the same way."

ADD REPLY • link 9 days ago by dsull ★ 6.0k

0

Entering edit mode

I think the blog post lines up wit both of our statements. It seems that 2-way binning (pass/fail) already dramatically improves the calls - which I think is the way this should have gone from the start; for some bases the instrument already knows that the information is lacking. But designating a full byte to track the error (basically the same amount of information as the basecall itself) was major mistake.

ADD REPLY • link 9 days ago by Istvan Albert 100k