3.0 years ago
@WouterDeCoster
Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.
Below I've added my function to average quality scores. I use Biopython for parsing fastq files:
from math import log
def ave_qual(quals):
"""Calculate average basecall quality of a read.
Receive the integer quality scores of a read and return the average quality for that read
First convert Phred scores to probabilities,
calculate average error probability
convert average back to Phred scale
"""
if quals:
return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
else:
return None
I use this function as below, in which fq
is the name of a fastq file:
from Bio import SeqIO
def extract_from_fastq(fq):
"""Extract quality score from a fastq file."""
for rec in SeqIO.parse(fq, "fastq"):
yield ave_qual(rec.letter_annotations["phred_quality"]))
This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.
I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.