I'm wondering is there a convenient way to count quantity of bases in FASTQ file with Phred quality score equal or higher then 30? Thanks!
FastQC provides a graphical quality per base position summary, and if you dig into its output files, you will find mean, median, lower quartile, upper quartile, 10th percentile and 90th percentile quality per base position.
Why do you need total count?
We often use the number of bases with quality score >=30 as a quick assay of data quality. Besides, our sequencing facilities commonly use that statistic.
gunzip -c input.fastq.gz | paste - - - - | cut -f 4 | fold -w 1 | awk '($1>="?")' | wc -l
gunzip -c input.fastq.gz
paste - - - -
cut -f 4
fold -w 1
Thanks! But from where awk may know that for example "=" is less than "?" ? How can it compare just two characters?
Based on ASCII code decimal values. It is comparing value contained in $1 (which would be individual Q score) for being more than or equal to to ?.
more than or equal to
Hi genomax! Thanks a lot! Now it's clear.
Should i add something like this tr -d '\n' somewhere in the pipe to remove new line characters?
echo \n | awk '($0>="?")' results in n
tr -d '\n'
echo \n | awk '($0>="?")'
reformat.sh from BBTools / BBMap package can also do this:
reformat.sh in=file.fq.gz qchist=file.qchist.txt
For paired end files
reformat.sh in1=file1.fq.gz in2=file2.fq.gz qchist=file.qchist.txt
Login before adding your answer.