Tophat 'prep_reads error": beginning of quality values record not found!
1
0
Entering edit mode
8.6 years ago
mmitra ▴ 60

Hi all,

I ran tophat on my fastq file and I got the following error:

[2015-09-02 11:23:58] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2015-09-02 11:23:58] Checking for Bowtie
          Bowtie version:     2.2.5.0
[2015-09-02 11:24:00] Checking for Bowtie index files (genome)..
[2015-09-02 11:24:00] Checking for reference FASTA file
[2015-09-02 11:24:00] Generating SAM header
[2015-09-02 11:24:06] Reading known junctions from GTF file
[2015-09-02 11:24:50] Preparing reads
    [FAILED]
Error running 'prep_reads'
Error: beginning of quality values record not found! (@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:)

Any suggestions for this? Thanks so much!

RNA-Seq fastq tophat • 3.5k views
ADD COMMENT
0
Entering edit mode

Run grep -A4 @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: Input.fastq and copy what you get here.

ADD REPLY
0
Entering edit mode

Thanks for your help. I did the grep as you suggested and got the following:

grep -A4 @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: P_R3R4_filtered75.fastq
grep: 1:N:0:: No such file or directory
P_R3R4_filtered75.fastq:@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:
P_R3R4_filtered75.fastq-CGCACTCCTGCTCGGACAGCTCCAGGTACGTCTGGTGGTCAATCAGGCCCTTGCGGTA
P_R3R4_filtered75.fastq-@HWI-ST387:212:D1AA6ACXX:4:1102:2629:2187 1:N:0:
P_R3R4_filtered75.fastq-CAACACCACAGCCATTGCTGAGGCCTGGGCTCGCCTGGACCACAAGTTTGACCTGATGTATGCCAAACGTGCCTT
P_R3R4_filtered75.fastq-+

All the reads of this fastq file are of length 75. I created this file for running rMATS. I followed the awk command from here to extract all reads of length 75: Filtering Fastq Sequences Based On Lengths

I also did the tophat on the original fastq file (before extraction) and that ran fine.

ADD REPLY
0
Entering edit mode

Can you paste a cleaner version of the output. I doubt you would see something like grep: 1:N:0:: No such file or directory when you perform grep. Also why we are seeing P_R3R4_filtered75.fastq: or P_R3R4_filtered75.fastq- tag in front of every line. You know how fastq format looks like, right?. The awk command solution that you used assumes that a fastq record is distributed over four lines. That may be a problem but this is just my guess. I may not speculate much unless I see a cleaner output. Try:

grep -A4 "^@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0"

and paste the output again.

ADD REPLY
0
Entering edit mode

Sorry, I forgot to put the search item in quotes. I ran the following command:

grep -A4 '@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:' P_R3R4_filtered75.fastq

I got the following:

@HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0:
CGCACTCCTGCTCGGACAGCTCCAGGTACGTCTGGTGGTCAATCAGGCCCTTGCGGTA
@HWI-ST387:212:D1AA6ACXX:4:1102:2629:2187 1:N:0:
CAACACCACAGCCATTGCTGAGGCCTGGGCTCGCCTGGACCACAAGTTTGACCTGATGTATGCCAAACGTGCCTT
+
ADD REPLY
0
Entering edit mode

The problem is that @HWI-ST387:212:D1AA6ACXX:4:1102:2633:2167 1:N:0: read has no quality information. A fastq entry takes 4 lines. The first line contains the header, second lines contains the sequence, third line is usually the + sign, and the fourth line contains quality sequence. The above fastq entry which is throwing error is missing third and the fourth line. If you dont know why it happened probably delete this fastq entry and make sure you delete all such entries.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. I am wondering if there is a way to globally check whether the fastq entries of all the reads are fine and to remove the corrupt entries. I have several fastq files and that would be very useful.

ADD REPLY
0
Entering edit mode

I have posted my response as an answer. Please accept it if it has adressed your problem.

ADD REPLY
0
Entering edit mode
8.6 years ago
cat Input.fastq | paste - - - - | awk ' $1 ~ /^@HWI/ && $3 ~/^+/' |  sed 's/\t/\n/g'  > QC_filtered.fastq

This code will remove any four lines where first line doesn't start with @HWI (matching string may change with files so do change it accordingly) and third line doesn't start with +. The above code works on the assumption that a fastq entry spans 4 lines.

ADD COMMENT
0
Entering edit mode

Thanks for the code. I tried but it did not work. It gave an empty output. I also checked for "@HWI" in my input file. I am assuming it worked for you. Any suggestions?

ADD REPLY
0
Entering edit mode

I am not sure why it didn't give you any output but I can see a potential bug in my code. As soon it meets the first weird or wrong fastq entry it would start throwing error for all the entries afterwards as it is reading 4 reads at a time and that order has already been messed up by the first wrong entry. This is a good piece of code (https://scipher.wordpress.com/2010/05/06/simple-python-fastq-parser/) that will help you to find the problem but you may have to manually delete bad fastq entries.

ADD REPLY

Login before adding your answer.

Traffic: 2703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6