Question

For loop that continues finding matches in a file

0

Entering edit mode

7.4 years ago

yarmda ▴ 40

I am trying to subset a fastq file by organism by searching through a fastq file for the readID values in a list. However, I'm only getting the first result back, instead of all instances of the readID in the file.

When I try something like

readID = [list of several different readIDs]
record_dict = SeqIO.index("input.fastq", "fastq")
with open("output.fastq","w") as fastq:
    for read in readID:
        fastq.write(record_dict[read].format("fastq"))

I am only retrieving the first instance of that readID in the input fastq, instead of every one that matches.

Any advice on continuing to look through the input fastq file for additional reads that match the readID? That is, how to prevent this block from stopping after its first match?

pysam python fastq loop • 1.7k views

ADD COMMENT • link updated 7.4 years ago by WouterDeCoster 47k • written 7.4 years ago by yarmda ▴ 40

score 0 · Answer 1 · 2016-12-06

0

Entering edit mode

7.4 years ago

WouterDeCoster 47k

I don't think the problem is in your for loop, but rather in your index, which returns a dictionary in which duplicate keys are not possible.

Why don't you just loop over the fastq file and check if the readID matches to one of your list?

ADD COMMENT • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

This might be wiser. I was enjoying the fastq format that is innate to pysam. Is there an easy way I can maintain that while looping through the fastq without the dictionary?

ADD REPLY • link 7.4 years ago by yarmda ▴ 40

0

Entering edit mode

I'm not sure which part of the code is from pysam? Looks like Biopython SeqIO to me, but I can be wrong. I would write it like:

readIDlist = [list of several different readIDs]
mysequences = [record for record in SeqIO.parse("input.fastq", "fastq") if record.id in readIDlist]
SeqIO.write(mysequences, "output.fastq", "fastq")

Note that in this code the "good" sequences are gathered in a list and then written to disk simultaneously. As such this is not really memory efficient. If your fastq file is reasonably small or your memory is large, this is conceptually easy and probably the fastest. Alternatively:

readIDlist = [list of several different readIDs]
with open("output.fastq    ", "w") as output:
for record in SeqIO.parse("input.fastq", "fastq"):
    if record.id in readIDlist:
        output.write(record.format("fastq"))

But I would expect this solution to be slower. Would be interesting to test.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, I forgot to post how the readID dictionary is generated using pysam's fetch function. This is now the primary issue that I've identified. I am not getting the correct read.query_names from using fetch this way.

For context:

for line in [list of accession numbers]:
    for read in samfile.fetch(line):
        readID.append(read.query_name)

ADD REPLY • link 7.4 years ago by yarmda ▴ 40

0

Entering edit mode

readID, as you construct it here, is not a dictionary but a list. I'm unfamiliar with using pysam fetch for this, I also don't know what you aim to achieve here.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, the terminology gets muddled in my head sometimes. I am trying to extract reads from a fastq file that match a given list of accession numbers. The process is to extract readIDs (query_names) from a BAM file by accession number and then extract the reads themselves from the fastq that was used to generate the BAM file by the readID. Does that make more sense?

I started posting because any time a read mapped to multiple different accession numbers and any of them was on my list, I'd retrieve that read for each mapping instance.

ADD REPLY • link 7.4 years ago by yarmda ▴ 40

0

Entering edit mode

Isn't all information to generate the fastq file already in the bam file? Essentially, you don't have to go back to the original fastq file because bam file also contains nucleotides and qualities.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I've been operating under the impression that a BAM file only contained the nucleotides that mapped and this may not be the entire read, in some cases.

ADD REPLY • link 7.4 years ago by yarmda ▴ 40

0

Entering edit mode

I don't think that's the case, but I'm not 100% sure about it either. But you can easily check this, using bamtofastq to compare to your original fastq file... might depend on the aligner which was used.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k