Question

Checking fastq is valid

2

Entering edit mode

7.3 years ago

flyamer ▴ 60

Hi, I have a suspicion one or more of my fastq files is corrupted: some reads have sequence and quality of different lengths, for example. Would this python script work to detect a problem?

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.stdin, "fastq"):
    pass

And I use it like this:

pigz -dc -p 3 serum-3_2.fq.gz | python check.py

Will it raise an error if one of the reads is non-valid? Or is there a better quick way to check this?

next-gen sequencing • 19k views

ADD COMMENT • link updated 4.0 years ago by clay.l.mcleod ▴ 40 • written 7.3 years ago by flyamer ▴ 60

2

Entering edit mode

why not try with test file?

ADD REPLY • link 7.3 years ago by shenwei356 8.4k

0

Entering edit mode

Ha, I don't know why not, good idea, thanks! I checked, seems like it at least picks up differences in sequence and quality lengths.

ADD REPLY • link 7.3 years ago by flyamer ▴ 60

0

Entering edit mode

A: Fastq Quality Read And Score Length Check

ADD REPLY • link 7.3 years ago by Medhat 9.7k

4

Entering edit mode

7.3 years ago

apa@stowers ▴ 600

If you only care about read length differing from quality-score length, you could just run this:

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' | wc -l

That will give you a count of aberrant records.

ADD COMMENT • link 7.3 years ago by apa@stowers ▴ 600

0

Entering edit mode

thanks to @apa

this will give you those reads

zcat fastq.gz | paste - - - - | awk -F"\t" '{ if (length($2) != length($4)) print $0 }' |  tr '\t' '\n' > error_reads.fastq

ADD REPLY • link updated 4.9 years ago by Ram 43k • written 7.3 years ago by Medhat 9.7k

3

Entering edit mode

4.0 years ago

clay.l.mcleod ▴ 40

You can use fqlint, a Rust program that identifies a broad range of issues Illumina-based FASTQ files. To install it, you can do the following after installing Rust.

cargo install --git https://github.com/stjude/fqlib.git

ADD COMMENT • link 4.0 years ago by clay.l.mcleod ▴ 40

2

Entering edit mode

7.3 years ago

John 13k

It's very difficult to say what is and isn't an invalid FASTQ, as there is no definitive specification.

A better approach is to write some tests that you think should be true, for example, that the SEQ and QUAL values are the same length, that there is always 4 lines per entry, etc. Do not assume that if a bioinformatics tool/parser does not give errors, it means there are no errors. This is so very often not the case, particularly surrounding FASTA/FASTQ as the specs are so open to interpretation.

One you have determined that the data is not "nonsense" but rather "missense" to borrow a term from genetics, then you should perhaps analyse distributions of the data to see if it makes sense. You can use uQ to do this very quickly with the --peek parameter, for example the command:

python /Users/John/Desktop/uq.py -i /Users/John/Downloads/ENCFF000ZZU.fastq.txt --peek

outputs this:

http://pastebin.com/raw/6qTMwNTp

This also checks that SEQ/QUAL is the same length, and some other very very basic things. It's just a proof-of-concept for compressing FASTQ files to be as small as possible, but it's not intended to be used.

ADD COMMENT • link updated 4.9 years ago by Ram 43k • written 7.3 years ago by John 13k

1

Entering edit mode

7.3 years ago

guipagui ▴ 10

With this tool : FastQValidator

ADD COMMENT • link 7.3 years ago by guipagui ▴ 10

0

Entering edit mode

Provide a link, when you are referring to a specific program. This is important since software programs may have similar names and searching the web may sometimes lead one down an undesired path (e.g. malware etc).

I will include a link for FastQValidator this time.

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

I will know it. Thanks.

ADD REPLY • link 7.3 years ago by guipagui ▴ 10

0

Entering edit mode

7.3 years ago

BioinfGuru ★ 1.7k

Try FASTQC

ADD COMMENT • link updated 4.9 years ago by Ram 43k • written 7.3 years ago by BioinfGuru ★ 1.7k

score 1 · Accepted Answer · 2017-01-11

1

Entering edit mode

7.3 years ago

flyamer ▴ 60

Yes, it raises an error if sequence and quality strings have different length.

ADD COMMENT • link 7.3 years ago by flyamer ▴ 60