Question

fastq parsing inefficiencies with python

1

Entering edit mode

6.2 years ago

brismiller ▴ 50

Hey everybody. I have written a custom python script (below) but it is too slow and I need pointers as to where the code can be improved.

What this script is doing:

I have two files, the first is a large fastq file (single-end reads) and the second is a text file with all of the header lines in the fastq file that I want to keep. The text file and fastq file are in same order with some of the fastq reads not having a corresponding header line in the text file. This script walks down both files and writes the desired reads to a new file. This script works but it is too slow, my large fastq file is ~20GB (~100millionreads) and my current estimate puts me at ~20days for it to finish (I ran it on a small sample). As someone with not much experience in bigdata, I'm sure there is something obvious that I could be doing to speed this up, any help is appreciated.

 #!/usr/bin/python3

import sys
from itertools import islice
# first input (sys.argv[1]) is the original fq file name
# second input (sys.argv[2]) is the quality threshold used (e.g. '30')

print("\nentering python script")

header_file_name = sys.argv[2] + "_good_read_headers.txt"
newfile_name = sys.argv[2] + "_good_reads.fastq"

print ("...making new fastq file for: " + goodreads)

file_iterator = 0
head_iterator = 0

while True:
    four_line_fq, header = [], []

    with open(sys.argv[1], 'r') as name_original:  # opening original fq file
        lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4))
        for i in lines_gen:
            four_line_fq.append(i)

    with open(header_file_name, 'r') as name_header:  # opening good header names only file
        headers_gen = islice(name_header, head_iterator, head_iterator + 1)
        for i in headers_gen:
            header.append(i)

    if len(four_line_fq) == 0:  # if at end of original fq file
        print("...end of original fastq file reached, done writing")
        break

    if four_line_fq[0] == header[0]:
        with open(newfile_name, 'a+') as new:
            for i in four_line_fq:
                new.write(i)
        head_iterator += 1
    file_iterator += 1

print("...exiting python script\n")

python RNA-Seq • 2.8k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 6.2 years ago by brismiller ▴ 50

2

Entering edit mode

Completely agree with Ram. Quickly jump to Biopython

ADD REPLY • link 6.2 years ago by lakhujanivijay 5.8k

2

Entering edit mode

Your strategy of re-opening the files for each slice is very time consuming. You should open each input file once, then iterate through the lines. Same goes for the output file which should be opened just once.

ADD REPLY • link 6.2 years ago by jomo018 ▴ 720

0

Entering edit mode

Why are you writing a parser yourself? There are multiple tools that do this. IMO people have to stop writing stuff from scratch unless they have an extremely good reason to do so.

ADD REPLY • link 6.2 years ago by Ram 43k

0

Entering edit mode

Although I agree that using well tested and optimized modules is the way to go, a good reason to write your own parser would be to learn how to do the programming yourself, rather than using the code of others...

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

For widely-performed-on-large-files tasks like these, I guess I prioritize getting stuff done over learning efficiency intricacies. But that's my bias against tools that re-invent the wheel.

ADD REPLY • link 6.2 years ago by Ram 43k

score 4 · Accepted Answer · 2018-02-06

4

Entering edit mode

6.2 years ago

Bastien Hervé 5.3k

Using Biopython as advised. Assuming you have a header.txt file with headers in one column (separate by "\n")

from Bio import SeqIO 

###Open your result file
filtered_fastq = open("your_filtered_fastq.fastq", "a")
###Create an array to keep headers
header_array=[]
###Stock headers in header_array
with open('header.txt', 'r') as f:
    for line in f:
        ###Remove the \n of your header
        header_array.append(line[:-1])
###Read your fastq file
###SeqIO read the file record by record. Record is (Name of read, sequence, "+", quality)
for record in SeqIO.parse('your_fastq.fastq','fastq'):  
    ###For each read if fastq header exist in your header_array, record will be write in your result file   
    if record.id in header_array:
        ###Write the record
        SeqIO.write(record, filtered_fastq,'fastq')
###Close file
filtered_fastq.close()

I'm interested by the final calculation time, if you could write it in comment, Thanks !

ADD COMMENT • link 6.2 years ago by Bastien Hervé 5.3k

2

Entering edit mode

Or throw in a list comprehension because those are neat:

header_array=[line.rstrip() for line in open('header.txt', 'r')]

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Right, 0.15s faster per million line but still faster and cleaner !

ADD REPLY • link 6.2 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Welp that's an impressive speedup :-D I'm just a list/dict comprehension fanboy...

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

1

Entering edit mode

This is not an efficient solution because you are not taking advantage of the details as stated by brismiller, mainly that the files are ordered.There is no reason to use large memory for holding 100M headers and there is no reason to search each fastq header in these 100M possible headers. Actually, brismiller does exploit the ordering of files albeit incorrectly.

ADD REPLY • link 6.2 years ago by jomo018 ▴ 720

1

Entering edit mode

I think brismiller is storing the entire fastq file in "lines_gen"

lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4))

and the same for headers, which is time consuming and memory expensive.

I see want you mean by "exploit the ordering", but if you know a way to catch record without reading the whole file or without holding it in memory, i am interested to know how.

Moreover, headers may have be ordered, record10.id in fastq doesn't correspond to header10 in the header file.

ADD REPLY • link 6.2 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Well, we could try to exploit this ordering by creating a generator from the headers rather than a list, since we only care for a match of the fastq header to the next() element?

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you everyone for their comments, and thanks Bastien for your code. Biopython is a fantastic resource!!

As stated previously the total fastq file is 107.4 million reads (21.9GB), I took the first 100,000 reads and ran Bastien's code on it, it took 51.98268 seconds. This is a great improvement, but it still puts me at 17.29hrs to parse the total fastq file.

ADD REPLY • link 6.2 years ago by brismiller ▴ 50

0

Entering edit mode

I bet there is a faster way somewhere, like said above, taking advantage of ordering... If you have time maybe try in C language. And also if you really want to improve the calculation speed, try to catch which operation takes the longest time to process. I don't think the header_array filling will be the problem, I rather bet on SeqIO.write(record, filtered_fastq,'fastq'). How many reads do you have to keep, in other words, how many headers do you have ? You can also try to split your fastq file and multiprocess the header reseach (http://sebastianraschka.com/Articles/2014_multiprocessing.html)

ADD REPLY • link 6.2 years ago by Bastien Hervé 5.3k