Checking fastq is valid
6
2
Entering edit mode
7.3 years ago
flyamer ▴ 60

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
    pass

And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py

Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

next-gen sequencing • 19k views
ADD COMMENT
2
Entering edit mode

why not try with test file?

ADD REPLY
0
Entering edit mode

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
7.3 years ago
flyamer ▴ 60

Yes, it raises an error if sequence and quality strings have different length.

ADD COMMENT
4
Entering edit mode
7.3 years ago
apa@stowers ▴ 600

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

That will give you a count of aberrant records.

ADD COMMENT
0
Entering edit mode

thanks to @apa

this will give you those reads

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' |  tr '\t' '\n' > error_reads.fastq
ADD REPLY
3
Entering edit mode
4.0 years ago

You can use fqlint, a Rust program that identifies a broad range of issues Illumina-based FASTQ files. To install it, you can do the following after installing Rust.

cargo install --git https://github.com/stjude/fqlib.git
ADD COMMENT
2
Entering edit mode
7.3 years ago
John 13k

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek

outputs this:

http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

ADD COMMENT
1
Entering edit mode
7.3 years ago
guipagui ▴ 10

With this tool : FastQValidator

ADD COMMENT
0
Entering edit mode

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

ADD REPLY
0
Entering edit mode

I will know it. Thanks.

ADD REPLY
0
Entering edit mode
7.3 years ago
BioinfGuru ★ 1.7k

Try FASTQC

ADD COMMENT

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6