Biostar Beta. Not for public use.
Question: Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?
12
Entering edit mode

Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.

ADD COMMENTlink 7.0 years ago 14134125465346445 ♦ 3.4k • updated 3.4 years ago Shicheng Guo ♦ 7.5k
Entering edit mode
0

The tool FastQC has a good guesser. Or use the following perl script: fastqFormatDetect.pl

Both base their results according to the characters encountered within the score line of the fastq file. It's well explained above or on the fastq wiki page.

ADD REPLYlink 3.3 years ago
Juke-34
♦ 2.2k
Entering edit mode
0

That link is too old and gives 404

ADD REPLYlink 3.3 years ago
Xapple
• 30
Entering edit mode
0

I'm looking for the new URL ... nevertheless I found a Github that had a copy of it. I modified the link accordingly.

ADD REPLYlink 3.3 years ago
Juke-34
♦ 2.2k
11
Entering edit mode
ADD COMMENTlink 7.0 years ago Istvan Albert 80k
Entering edit mode
2

Thanks, that worked:

gunzip -c file.fastq.gz | awk 'NR % 4 == 0' | head -n 1000000 | python ./guess-encoding.py

ADD REPLYlink 7.0 years ago
14134125465346445
♦ 3.4k
Entering edit mode
2

note that you can just send -n 100000 as an argument to guess-encoding.py

ADD REPLYlink 7.0 years ago
brentp
23k
Entering edit mode
0

guess-encoding.py need to be updated

ADD REPLYlink 4.7 years ago
Medhat
8.3k
Entering edit mode
0

It seems guess-encoding.py has a misleading example, suggesting cut -f 5 instead of cut -f 11 to grab quality strings.

ADD REPLYlink 21 months ago
johnsenkyle13
• 0
8
Entering edit mode

if the quality scores contain character 0 it is either Sanger phred+33 or Illumina 1.8+ phred+33. When they also contain the character J, it is Illumina 1.8+ phred 33, otherwise it is Sanger phred + 33.

When the quality scores do not contain 0, it is either Solexa +64, Illumina 1.3+ Phred+64, Illumina 1.5+ Phred+64.

Then it is Solexa +64 when it contains character =

It is Illumina 1.3 phred + 64 when it contains A

It is Illumina 1.5 phred +64 when it contains no A or =

Take a look at the wiki and try to understand the table

ADD COMMENTlink 7.0 years ago Irsan ♦ 6.9k
7
Entering edit mode

head -n 40 file.fastq | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding\!";}'

source

ADD COMMENTlink 4.7 years ago Medhat 8.3k
5
Entering edit mode

You can use this tool :

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

It has an internal automatic guesser.

T.

ADD COMMENTlink 7.0 years ago toni ♦ 2.1k
3
Entering edit mode

BBMap as a little tool for this:

$ testformat.sh in=N0174.fq.gz
sanger fastq gz interleaved 150bp

ADD COMMENTlink 4.7 years ago Brian Bushnell 16k
2
Entering edit mode

If you are searching for a quick dirty method, then just grep for any Sanger or Phred64 unique character. You can find it http://en.wikipedia.org/wiki/FASTQ_format

grep Z filename # for Phred64 and make sure that the lines are not headers

ADD COMMENTlink 7.0 years ago Gvj • 440
2
Entering edit mode

As noted by medhat above, GNU od or hexdump can be used to convert the quality scores to their decimal value, so

 cat file.fq | awk 'NR%4==0' | tr -d '\n' | hexdump -v -e'/1 "%u\n"' | sort -nu

will display which (decimal) quality scores exist in your file.

According to brentp's "guess-encoding.py" script the possible ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).

ADD COMMENTlink 2.4 years ago n.caillou • 20
1
Entering edit mode

Install BBMap and then use the following script:

Usage:  reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>

reformat.sh in=Indx07.read1.fq out=Indx07.read1.phred33.fq qin=64 qout=33
reformat.sh in=Indx07.read2.fq out=Indx07.read2.phred33.fq qin=64 qout=33
ADD COMMENTlink 3.4 years ago Shicheng Guo ♦ 7.5k
1
Entering edit mode

Hey there, if you run FastQC you can see the quality format in the main output screen, in the section marked "Encoding"

ADD COMMENTlink 2.3 years ago ando.kelli • 40

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0