getting error I don't understand using biopython for retrieving sequences from fast file
2
0
Entering edit mode
8.4 years ago
andynkili ▴ 10

I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and I'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:

rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")
for k in rec_dict.keys():
    c=c+1
    if len(rec_dict[k].seq)>7000:
        handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")
        handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")
        handle.close()

I'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'

When I print the value of "c", I get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the I count all the iterations, not only those which match the condition). I don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' file instead of verifying its existence, shouldn't it?

biopython fasta • 3.2k views
ADD COMMENT
1
Entering edit mode

You shouldn't use the SeqIO index dictionary like this - every time you access a record with rec_dict[k] then it gets parsed from disk again, which is comparatively slow. Also, by accessing the records from the file in essentially random order (the key order in the dictionary) you are again going to slow down the disk IO. The solution is much simpler - just iterate of the file in one parse.

for record in SeqIO.parse("hpv_id_name_all.fasta", "fasta"):
    if len(record) > 700:
        # do stuff with record

Also you are not using Biopython for the FASTA output, which is sometimes appropriate.

ADD REPLY
0
Entering edit mode

+1 for FASTA line wrapping.

ADD REPLY
0
Entering edit mode

Are you sure? because doing what you've proposed is exactly what i was trying to avoid by using the dictionary. i thought iterate through the file was slower than indexing it and accessing only the record i needed by their key.

ADD REPLY
2
Entering edit mode

In order to get the length of each sequence, you have to look at the entire record for that sequence. So yes, a single parse though the file is the simplest and fastest solution here.

ADD REPLY
0
Entering edit mode

May I add that I've checked for the sequence from which the file should be created (in my code each sequence is written in a fasta file named after the sequence's id in the original fasta file(hpv_id_name_all.fasta, see code)). And this sequence ('EU410347.1|Human papillomavirus FA75/KI88-03_7401') has no particularity that could explain the IOError

ADD REPLY
0
Entering edit mode

This is a duplicate of a question on another site http://stackoverflow.com/questions/34015094/ioerror-while-retrieving-sequences-from-fasta-file-using-biopython - please don't do this (unless you include the link to the other question) as you are wasting volunteers' time helping you if someone else has already answered the question elsewhere.

ADD REPLY
3
Entering edit mode
8.4 years ago
dschika ▴ 320

The '/' in 'EU410347.1|Human papillomavirus FA75/KI88-03_7401' is the problem, as python assumes there should be a folder EU410347.1|Human papillomavirus FA75/ where to write the file KI88-03_7401.fasta. Replace the '/' with e.g. '_'. Btw: I would also replace the '|' symbol in filenames.

ADD COMMENT
0
Entering edit mode

Thank you very much, I changed every / in the file and it worked well.

ADD REPLY
1
Entering edit mode
8.4 years ago

You can also split up a fasta file like this using pyfaidx:

$ pip install pyfaidx
$ faidx hpv_id_name_all.fasta --split-files --size-range 7000,999999999

Or if you want to write your own script I would suggest using the Fasta class from the module.

ADD COMMENT

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6