Biostar Beta. Not for public use.
FASTQ Phred33 average base quality score
Entering edit mode
3.0 years ago
oars • 150

I have a FASTQ dataset where I'm trying to find the average base quality score. I found this old link that helped somewhat (

Here is my script (I'm trying to stick to awk, bioawk or python):

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz

I was expecting a single average output for the entire set, instead I got a huge dump of data with many rows looking like this:


I assume that the 36.45 is my average phred quality score for that sequence? I was hoping for an average score for the entire file. Could such a script be written?

FASTQ Phred33 ASCII • 1.5k views
Entering edit mode

I was hoping for an average score for the entire file

I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.

Entering edit mode
3.0 years ago

bioawk is to some extent still awk based (though with -c you get a diff read mode, as in 'per entry' rather then 'per line'). You can easily extent this cmdline to get the desired result:

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | paste - - | awk '{sum += $3} END {print sum/NR}'

should do the trick

Entering edit mode

Many thanks. Where you have | paste - - |, this is where I re-enter my file name?

because as written I just return the following: >

Entering edit mode

no, that's a paste 'trick' : it will take two consecutive lines from the stream and print them on a single line (tab delineated) . so to get the mean qual values . You can also filter the stream with awk to only take each second line.

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | awk '{ if (NR%2==0) sum += $1} END {print sum/NR}'

You only need to put the input file once in the bioawk part . This all assume you just have results print to screen from bioawk though and thus not to a file!

Entering edit mode

and or BUT do have a look at what @WouterDeCoster writes , 'cus indeed simply taking the avg of phred scores has not much meaning

Entering edit mode

This worked, thank you. The output was 18.035 which I believe translates to at ASCII value of ~ 51.

Entering edit mode
3.0 years ago

Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.

Below I've added my function to average quality scores. I use Biopython for parsing fastq files:

from math import log

def ave_qual(quals):
    """Calculate average basecall quality of a read.
    Receive the integer quality scores of a read and return the average quality for that read
    First convert Phred scores to probabilities,
    calculate average error probability
    convert average back to Phred scale
    if quals:
        return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
        return None

I use this function as below, in which fq is the name of a fastq file:

from Bio import SeqIO

def extract_from_fastq(fq):
    """Extract quality score from a fastq file."""
    for rec in SeqIO.parse(fq, "fastq"):
        yield ave_qual(rec.letter_annotations["phred_quality"]))

This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3