Question

Extract fastq reads from reads list

2

Entering edit mode

7.4 years ago

felipelira3 ▴ 40

This is my code to extract the reads from a raw fastq file. I have one list with the headers of the reads and I want to retreive them from the original file. My output is an empty file. Maybe I did it not so well. Any help, improving this script will be appreciate. Less is more.

import sys
from Bio import SeqIO
from Bio.SeqIO.QualityIO import FastqGeneralIterator

list_of_reads = open(sys.argv[1], "r")
fastq_file = open(sys.argv[2], "r")

recover = set()

for line in list_of_reads:
        if line not in recover:
                read_name = line.split()[0]
                recover.add(read_name)  # add value to set

output = open("recovered.fastq", "w")
for title, seq, qual in FastqGeneralIterator(fastq_file):
        if title in recover:
                output.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
output.close()

python fastq extract SeqIO • 3.9k views

ADD COMMENT • link 7.4 years ago by felipelira3 ▴ 40

2

Entering edit mode

First guess: line in your first for loop ends with '\n', while title in the second for loop does not: add:

line = line.strip()

in the first loop before the if ... .

ADD REPLY • link 7.4 years ago by dschika ▴ 320

0

Entering edit mode

Wow a beginner mistake. Thank you @dschika

ADD REPLY • link 7.4 years ago by felipelira3 ▴ 40

2

Entering edit mode

I think (but this should be tested) that it would be quicker to just loop over the list_of_reads and create a list, and create a set out of that later. Something like

recover = set([line.split()[0] for line in list_of_reads])

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

1

Entering edit mode

If performance is important, there are existing tools that will do the job:

A: Extracting fastq files, based on their fasta counterparts

filterbyname will alternatively accept a list of read names as the argument to "names", rather than a fasta file.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

score 2 · Answer 1 · 2016-11-29

2

Entering edit mode

7.4 years ago

felipelira3 ▴ 40

improved script

#usage: script [list_of_reads] [fastq_file] [-output_file_name]

import sys
from Bio import SeqIO

list_of_reads= open(sys.argv[1], "r")
recover = set([line.strip().split(' ')[0] for line in list_of_reads])

output = open(sys.argv[3], "w")         # uncomment to write the output file
for record in SeqIO.parse(sys.argv[2], "fastq"):
        if record.id in recover:
                output.write(record.format("fastq"))
output.close()

ADD COMMENT • link 7.4 years ago by felipelira3 ▴ 40

0

Entering edit mode

Looks good, just another piece of advice: it's better (and more convenient) to use:

with open(sys.argv[1], 'r') as list_of reads:
    recover = set([line.strip().split(' ')[0] for line in list_of_reads])

The with statement takes care of opening and closing the file (which you didn't do for sys.argv[1])

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k