Biostar Beta. Not for public use.
Checking fastq is valid
1
Entering edit mode
2.9 years ago
flyamer • 30
Russian Federation

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
    pass

And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py

Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

ADD COMMENTlink
2
Entering edit mode

why not try with test file?

ADD REPLYlink
0
Entering edit mode

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
1
Entering edit mode
2.9 years ago
flyamer • 30
Russian Federation

Yes, it raises an error if sequence and quality strings have different length.

ADD COMMENTlink
2
Entering edit mode
15 months ago
John 12k
Germany

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek

outputs this:

http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

ADD COMMENTlink
2
Entering edit mode
23 months ago
apa@stowers • 420
Kansas City

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

That will give you a count of aberrant records.

ADD COMMENTlink
0
Entering edit mode

thanks to @apa

this will give you those reads

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' |  tr '\t' '\n' > error_reads.fastq
ADD REPLYlink
1
Entering edit mode
3.3 years ago
guipagui • 10

With this tool : FastQValidator

ADD COMMENTlink
0
Entering edit mode

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

ADD REPLYlink
0
Entering edit mode

I will know it. Thanks.

ADD REPLYlink
0
Entering edit mode
21 months ago
YaGalbi ♦ 1.4k
Biocomputing, MRC Harwell Institute, Ox…

Try FASTQC

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1