Extract fastq reads from reads list
1
2
Entering edit mode
7.4 years ago
felipelira3 ▴ 40

This is my code to extract the reads from a raw fastq file. I have one list with the headers of the reads and I want to retreive them from the original file. My output is an empty file. Maybe I did it not so well. Any help, improving this script will be appreciate. Less is more.

import sys
from Bio import SeqIO
from Bio.SeqIO.QualityIO import FastqGeneralIterator

list_of_reads = open(sys.argv[1], "r")
fastq_file = open(sys.argv[2], "r")

recover = set()

for line in list_of_reads:
        if line not in recover:
                read_name = line.split()[0]
                recover.add(read_name)  # add value to set

output = open("recovered.fastq", "w")
for title, seq, qual in FastqGeneralIterator(fastq_file):
        if title in recover:
                output.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
output.close()
python fastq extract SeqIO • 3.9k views
ADD COMMENT
2
Entering edit mode

First guess: line in your first for loop ends with '\n', while title in the second for loop does not: add:

line = line.strip()

in the first loop before the if ... .

ADD REPLY
0
Entering edit mode

Wow a beginner mistake. Thank you @dschika

ADD REPLY
2
Entering edit mode

I think (but this should be tested) that it would be quicker to just loop over the list_of_reads and create a list, and create a set out of that later. Something like

recover = set([line.split()[0] for line in list_of_reads])
ADD REPLY
1
Entering edit mode

If performance is important, there are existing tools that will do the job:

A: Extracting fastq files, based on their fasta counterparts

filterbyname will alternatively accept a list of read names as the argument to "names", rather than a fasta file.

ADD REPLY
2
Entering edit mode
7.4 years ago
felipelira3 ▴ 40

improved script

#usage: script [list_of_reads] [fastq_file] [-output_file_name]

import sys
from Bio import SeqIO

list_of_reads= open(sys.argv[1], "r")
recover = set([line.strip().split(' ')[0] for line in list_of_reads])

output = open(sys.argv[3], "w")         # uncomment to write the output file
for record in SeqIO.parse(sys.argv[2], "fastq"):
        if record.id in recover:
                output.write(record.format("fastq"))
output.close()
ADD COMMENT
0
Entering edit mode

Looks good, just another piece of advice: it's better (and more convenient) to use:

with open(sys.argv[1], 'r') as list_of reads:
    recover = set([line.strip().split(' ')[0] for line in list_of_reads])

The with statement takes care of opening and closing the file (which you didn't do for sys.argv[1])

ADD REPLY

Login before adding your answer.

Traffic: 2975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6