Parse Fastq File - Pad Reads With N'S
5
1
Entering edit mode
12.2 years ago
Steffi ▴ 580

I would like to parse a fastq file as follows: Nominal read length is 100, but some reads are shorter than others. In that case I want to pad these reads with N's and their corresponding quality values as well. So in the end all reads should have the same length. Any idea how to do this most quickly? Shell scripting or perl? Or does somebody knows a tool who already offers something along these lines? Thanx!

fastq parsing • 6.1k views
ADD COMMENT
1
Entering edit mode

I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.

ADD REPLY
1
Entering edit mode

this is just nuts, just contact the RUM people about this issue

ADD REPLY
0
Entering edit mode

Is Python OK for you? Do you have a particular quality score in mind for the padded N bases?

ADD REPLY
0
Entering edit mode

Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.

ADD REPLY
0
Entering edit mode

Ah sorry, I did not answer the quality score question: I think it does not matter, just put anything (valid).

ADD REPLY
0
Entering edit mode

I have done this, but til they have checked this I just want to go on with the data analysis...

ADD REPLY
0
Entering edit mode

Can this be done with fasta files?

ADD REPLY
8
Entering edit mode
12.2 years ago

As long as the sequences in the fastq file do not span multiple lines you can get by with a very elegant and fast python solution that uses iterators. This avoids all conditionals and other complexities, it could look like this:

import itertools, string, sys

# strip newlines from the standard input
stream = itertools.imap(string.strip, sys.stdin)

# this works only if sequences span only one line
for head in stream:

    # advance the stream
    seq  = stream.next()
    tmp  = stream.next()
    qual = stream.next()

    # this is how much to pad
    size = 100 - len(seq)

    # pad each line
    seq  = seq  + "N" * size
    qual = qual + "2" * size

    # print the joined elements
    print "\n".join( (head, seq, tmp, qual) )
ADD COMMENT
1
Entering edit mode

Very short and elegant solution! Only 4 lines at a time are loaded in memory? It reads from the stdin, so I can use it like this? python awesome.py < myfile.fastq > mypaddedfile.fastq

ADD REPLY
0
Entering edit mode

yes to all, perhaps I should add it to the description

ADD REPLY
0
Entering edit mode

Thanks! I have used it and it worked perfectly fine and very fast!

ADD REPLY
0
Entering edit mode

This is wonderful. Thanks! A note to anyone else who might run into the same thing I did: I had to change size to 101.

ADD REPLY
0
Entering edit mode

This is an old thread but still.. any reason why you pad the quality line with "2"s? Wouldn't the lowest quality score "!" make more sense here?

ADD REPLY
0
Entering edit mode

I forgot the exact reason and it may not be applicable here - I think the rationale in general when generating qualities is that some tools ignore bases that are very low in quality so we need to avoid making them too low. But then why not make them maximal? There is a reason there too, a quality such as say E can come from different types of encoding, so using that makes the encoding the file ambiguous so that too may be a problem. Picking 2 seems like a good choice between indicating Sanger encoding and avoiding bases being ignored.

ADD REPLY
3
Entering edit mode
12.2 years ago
Rok ▴ 190

GNU Awk version:

awk 'NR%4==0 {print t $0}; NR%4==2 {while(a++<100-length($0)){s=s "Q"; t=t "N"};print s $0}; NR%2==1 {print}' file.fastq

You can define different characters for replacement in sequence line and quality score line(N and Q above).

GNU Sed version (do not use, really slow):

sed -e :a -e '2~4 s/^.\{1,99\}$/N&/;ta' -e '4~4 s/^.\{1,99\}$/Q&/;ta' file.fastq
ADD COMMENT
0
Entering edit mode

this is neat, we have kind of a code-golf going on

ADD REPLY
0
Entering edit mode

I thought awk version would need less time than Python version, but on 2 million lines they needed the same time.

ADD REPLY
0
Entering edit mode

I would have thought the other way around since awk concatenates one by one - real life tests are often surprising

ADD REPLY
0
Entering edit mode

Is the awk version correct? It pads at the beginning of the line, not at the end. Plus, quality and sequence paddings seem to be inverted. Rok can you edit your post, please. I propose this corrected solution: awk 'NR%4==0 {print $0 t}; NR%4==2 {while(a++<100-length($0)){s=s "N"; t=t "Q"};print $0 s}; NR%2==1 {print}' file.fastq

ADD REPLY
3
Entering edit mode
12.2 years ago
Gustavo ▴ 530

No Perl version yet? :)

#!/bin/env perl
$target = 100;
while (<>) {
    if (/^(\@|\+)/) {
        print;
        $_ = <>;
        chomp;
        print $_, ($1 eq '@' ? 'N' : '2') x ($target-length()), "\n";
    }
}
ADD COMMENT
1
Entering edit mode
12.2 years ago

Here's something in python. Doesn't require biopython. It's pretty ugly, but should get the job done:

import sys

inFile = open(sys.argv[1],'r')

header = seq = qual = ''
seqs = quals = False

for line in inFile:
    if line[0] == "@":
        if header != '':
            if len(seq) < 100:
                pad = 100-len(seq)
                seq += 'N' * pad
                qual += '>' * pad

            print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

        header = line[1:].strip()
        quals = False
        seqs = True
        qual = seq = ''
    elif line[0] == "+":
        if quals:
            qual += line.strip()
        else:
            quals = True
            seqs = False
    else:
        if quals:
            qual += line.strip()
        else:
            seq += line.strip()

if len(seq) < 100:
    pad = 100-len(seq)
    seq += 'N' * pad
    qual += '>' * pad

print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

Use it so:

python theScript.py yourFile.fastq > output.fastq
ADD COMMENT
0
Entering edit mode

thanks a lot!! I will try it and let you know how it worked!

ADD REPLY
1
Entering edit mode
12.2 years ago

Here's a pure Bash solution. The code could be factorized and a little bit simplified, but it works.

#! /bin/bash
# name: pad_fastq.sh
# usage: bash pad_fastq.sh myfile.fastq

header=""
seq=""
plus=""
qual=""

while read line ; do
    if [[ ${line:0:1} == "@" ]] ; then
        header="${line}"
    elif [[ ${line:0:1} == "+" ]] ; then
        plus="${line}"
    elif [[ "${plus}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" ">" ; done)
            qual="${line}${padding}"
        else
            qual="${line}"            
        fi
        echo -e "${header}\n${seq}\n${plus}\n${qual}"
        header=""
        seq=""
        plus=""
        qual=""
    elif [[ "${header}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" "N" ; done)
            seq="${line}${padding}"
        else
            seq="$line"            
        fi
    fi
done < "${1}"

exit 0
ADD COMMENT

Login before adding your answer.

Traffic: 1850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6