Iterating Through Fasta Sequences With Pyfasta Python Module
2
3
Entering edit mode
14.0 years ago

I just started playing with the pyfasta Python module, which is described as a fast, memory-efficient, pythonic (and command-line) access to fasta sequence files.

It looks very promising for my work, but I am currently unable to use it for a very simple application, namely: I want to treat sequences from a fasta file sequentially, respecting the order of the sequences in the fasta file from which I read them. I have tried the following:

import pyfasta

f = pyfasta.Fasta("coreg.fa")
ncar = 60

with open("output.file", "w") as out_file:
    for header in f.keys():
        name = str(header)
        seq = str(f[header])
        out_file.write(name + "\n")
        while len(seq) > 0:
            out_file.write(seq[:ncar] + "\n")
            seq = seq[ncar:]

This simple example only reads my input fasta file and writes them back to an output fasta file. However, the initial order is randomized, since the headers are fitted into a dictionary based class (if I understood well), which modifies the order but permits a faster research time. This is not, however, exactly the behavior I am trying to achieve. So, my question is:

Is there a way to maintain the order from the original file?

(I do not want to iterate on sorted(f.keys()) since this does not give back the original order either.)

Many thanks!

python fasta • 7.2k views
ADD COMMENT
0
Entering edit mode

What exactly do you need to do with the sequences? I have a module that creates a class for FASTA files and should be able to maintain order with some small modifications.

ADD REPLY
0
Entering edit mode

I have 20 different uses for these sequences, depending on the project... However, they all have in common that I want to treat the sequences one at a time. The nice feature of pyfasta for me is that I don't have to put all the sequences in an object that has to reside in memory. For multiple Go fasta files, that will matter. Would you post your alternative @nuin ? I am interested to see it. Cheers!

ADD REPLY
0
Entering edit mode

Mine just put all the sequences in the memory. Send me an email and we can talk (nuin AT genedrift DOT org).

ADD REPLY
3
Entering edit mode
14.0 years ago
brentp 24k

That's a use-case I never thought of, but, as it happens, you can do that with the index. Mostly, you don't need to know about the index, but since it saves the file position, you can use that to order your chromosomes:

f = pyfasta.Fasta('some.fasta')
sorted_keys = [x[0] for x in sorted(f.index.items(), key=lambda a: a[1][0])]

and that doesn't re-parse the file. and in if you're doing that frequently, you could use a subclass:

class Fasta(pyfasta.Fasta):
    def sorted_keys(self):
        return [x[0] for x in sorted(self.index.items(), key=lambda a: a[1][0])]
ADD COMMENT
0
Entering edit mode

Thanks Brent! That was nested pretty far! You think it would be easy to add a feature making this possible out-of-the-box and that would not have to much of a time overhead compared to a fasta iterator on the file? Cheers

ADD REPLY
0
Entering edit mode

will, it's really just in f.index, so that's not nested too far. rather than add more features (which i have to document, test, maintain, and justify), i'd prefer to add it to the wiki or something.

ADD REPLY
1
Entering edit mode
13.1 years ago
krst ▴ 10

If you have Python 2.7+, you could modify pyfasta to the keys in an OrderedDict so that they stay in order.

ADD COMMENT

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6