FASTQ Phred33 average base quality score
2
1
Entering edit mode
6.2 years ago
oars ▴ 200

I have a FASTQ dataset where I'm trying to find the average base quality score. I found this old link that helped somewhat (https://www.biostars.org/p/47751/).

Here is my script (I'm trying to stick to awk, bioawk or python):

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz

I was expecting a single average output for the entire set, instead I got a huge dump of data with many rows looking like this:

    >FCD23GWACXX:2:1101:3183:2494
36.45

I assume that the 36.45 is my average phred quality score for that sequence? I was hoping for an average score for the entire file. Could such a script be written?

FASTQ Phred33 ASCII • 9.3k views
ADD COMMENT
0
Entering edit mode

I was hoping for an average score for the entire file

I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.

ADD REPLY
4
Entering edit mode
6.2 years ago

bioawk is to some extent still awk based (though with -c you get a diff read mode, as in 'per entry' rather then 'per line'). You can easily extent this cmdline to get the desired result:

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | paste - - | awk '{sum += $3} END {print sum/NR}'

should do the trick

ADD COMMENT
0
Entering edit mode

Many thanks. Where you have | paste - - |, this is where I re-enter my file name?

because as written I just return the following: >

ADD REPLY
0
Entering edit mode

no, that's a paste 'trick' : it will take two consecutive lines from the stream and print them on a single line (tab delineated) . so to get the mean qual values . You can also filter the stream with awk to only take each second line.

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | awk '{ if (NR%2==0) sum += $1} END {print sum/NR}'

You only need to put the input file once in the bioawk part . This all assume you just have results print to screen from bioawk though and thus not to a file!

ADD REPLY
2
Entering edit mode

and or BUT do have a look at what @WouterDeCoster writes , 'cus indeed simply taking the avg of phred scores has not much meaning

ADD REPLY
0
Entering edit mode

This worked, thank you. The output was 18.035 which I believe translates to at ASCII value of ~ 51.

ADD REPLY
10
Entering edit mode
6.2 years ago

Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.

Below I've added my function to average quality scores. I use Biopython for parsing fastq files:

from math import log

def ave_qual(quals):
    """Calculate average basecall quality of a read.
    Receive the integer quality scores of a read and return the average quality for that read
    First convert Phred scores to probabilities,
    calculate average error probability
    convert average back to Phred scale
    """
    if quals:
        return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
    else:
        return None

I use this function as below, in which fq is the name of a fastq file:

from Bio import SeqIO

def extract_from_fastq(fq):
    """Extract quality score from a fastq file."""
    for rec in SeqIO.parse(fq, "fastq"):
        yield ave_qual(rec.letter_annotations["phred_quality"]))

This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.

ADD COMMENT
0
Entering edit mode

Two years later... +1! I spent the last 2 hours trying to figure out why my dumb average of phred scores didn't match NanoPlot's (+1 also to that) until I dug into its code and eventually found this thread.

ADD REPLY
0
Entering edit mode

:)

ADD REPLY
0
Entering edit mode

This was very useful for me today. fastqc showed I had loads of reads >Q30 but chopper gave none. Turns out that fastqc is calculating the per read Q values wrong.

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6