Biostar Beta. Not for public use.
Fastq Quality Read And Score Length Check
4
Entering edit mode
6.0 years ago
United States

I have a huge Fastq file and I want to check if the sequence length and the quality score length is the same for each read. That is, for each read in the Fastq file (a read has 4 lines as we are aware), I want to check if the length of the 2nd line (sequence) and the length of the 4th line (quality score) are the same.

What is the most efficient way to go about it ?

Any suggestions will be greatly appreciated.

ADD COMMENTlink
0
Entering edit mode

This answer to a similar question may be helpful if you do find formatting errors.

ADD REPLYlink
5
Entering edit mode
15 months ago
Netherlands

The answer to your question is using awk.

See, for counting the read length, use :

awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' file.fastq > readLength

Output:

 2    ATCTCATTTTTCACGTTTTTTAGTGATTTCATAATTTTTCAAGTCGTAAAGTGGATGTTTCTCATTTTTCATGATT    76
 6    GTGTTAGGTGCCAAGATCCTTCTGTGTTTAAAGTGCCTTCAGGGGTGAGTGGTTTTAGTGCCATGTACTGTGGAAA    76
10    TGTATGGAGTGTTTAGATGCTCCCTTCTAGCTTCTTTTCAGACAATTTACTAGGATGATTGACAGAGAATTGCTTT    76
14    ACTGTAGCCCCGCCACACTGCCACTGGTATAGGACACCAAAAAGCACTCTCTAATAAATACAGAAAGATTAGCAGA    76
18    ACGCTGGAAACTTAGACTAGCAGTTAGAATGTAGTGGTTGATAGAGCATCCAGCAACCAGGGGTAAACTCGGCCTC    76
22    TTTTTAGTGATTTCGTCATTTTTCAAGTCGTCAAGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCCTCGCC    76
26    CAGCAAGCGAAATAGTTATGGCCCATGTTTTCTGGATATCAATTACACAACGGGGAGATTTGGAAGGAGAACATGC    76
30    CATTTCACATGAGTGAGACTCAAGGCTTGGGATGTGTAATACAAGATCAACACTTGGGACAAATTCCAGCATGGGG    76

where first column is the line number in the original fastq file, 2nd is the read in light and third the length of this read

and for the quality length, use :

awk '{if(NR%4==0) print NR"\t"$0"\t"length($0)}' file.fastq > qualityLength

 4    1==DDBEEHHHHDHIHIIIIIFHEHAGGHGIHIGHIIIIGGIIIGAFIIIGDGIHIIHCIIIHEHGEIII@HHGC?    76
 8    BBBFFFFFFHHHHHIGIJJIJIJJJIJIJJIJIHHGHIIIIJJJJ?DFHDFHHIIJJJIIIIJJJJJJJHFHGFGC    76
12    ??@D?D?DFFFDFGGIGH@>IHG@FGAA+CFF>9CGH@9DDH@GGICH<B@FHIBEH<GG@BF<BB=CA;F)7@GC    76
16    @@@DDDDDF>FFHIIGIGEGG??BFEC?9CFFE9DDGAGDDDF<CBE@FEHHGGHEDGHGHBA>EEEDCCBCCCCA    76
20    CCCFFFFFGHHHHJJJJJJIHIJHHGJIIIJHIJFHGADHGHGHDGIIIIIIJJIJJJJJJJI@GIHHHHH=BE?@    76
24    BCCFFDFDHHHHHJJJJJJJJJJGIIJJGHIIJIJ?FHCEHDGIJJIJJJJJIJJJJIJJJJIGJIIJJJIIGIGB    76
28    @CCDDFFDHHFHHJFGEGIIIIJAHGEEDGHGICHEEIIJ@HEEGIIJDCHHIIJ>EACEHFF?@DCEDCB>ACCD    76
32    @@@FFDDDHHDDDFHIJGIIGCGGIIIIIJJGHHCF1:FGFGIGGGHHIJJJDHGHIGHIIIGIJIIJIGHGHHHH    76

and finally, use awk again to compare the columns, here the column number 3

awk 'NR==FNR{a[$3]++;next}!a[$3]' readLength qualityLength

This will output the lines where the length is not matched, the output will be from the 2nd file.

FastQC will tell you the overall picture but this will exactly tell you what you want.

HTH

Edit : The code is better now as no need to use diff and to calculate the line number back (as suggested by @Istvan)

ADD COMMENTlink
1
Entering edit mode

just print out the NR variable on each line - that way one would not need to compute it at all

ADD REPLYlink
0
Entering edit mode

Good suggestion, edited the code and works better now, no need to use diff and line number calculation. Cheers

ADD REPLYlink
0
Entering edit mode

Unnecessary use of cat.

ADD REPLYlink
0
Entering edit mode

edited, late though :)

ADD REPLYlink
0
Entering edit mode

Please let me know how to plot all read length histogram??

ADD REPLYlink
3
Entering edit mode
13 months ago
JC 7.9k
Mexico

simple Perl solution reporting problematic reads:

#!/usr/bin/perl
use strict; 
use warning;
my ($read, $seqLen, $qualLen);
my $nl = 0;
while (<>)  {
    chomp;
    $nl++;
    if ($nl == 1) { 
        $read = $_; 
    }
    elsif ($nl == 2) { 
        $seqLen = length $_; 
    }
    elsif ($nl == 4) { 
        $qualLen = length $_;
        print "$read\t$seqLen != $qualLen\n" if ($seqLen != $qualLen);
        $nl   = 0;
    }
}
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1