Why SeqIO doesn't read the last record
2
1
Entering edit mode
8.0 years ago
-_- ★ 1.1k

I am writing some unit test and encountered a odd behavior from SeqIO in biopython. I have reproduced the problem in the following script

import mock
from Bio import SeqIO

FAKE_FASTA_CONTENT = '''>1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
'''

# https://docs.python.org/3.3/library/unittest.mock.html
m = mock.mock_open(read_data=FAKE_FASTA_CONTENT)
with mock.patch('__main__.open', m, create=True):
    with open('foo') as inf:
        io = SeqIO.parse(inf, 'fasta')
        for k, rec in enumerate(io):
            print(rec)

The output is

ID: 1
Name: 1
Description: 1
Number of features: 0
Seq('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...AAA', SingleLetterAlphabet())
ID: 2
Name: 2
Description: 2
Number of features: 0
Seq('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...AAA', SingleLetterAlphabet())

I wonder why it cannot find the third (also last) sequence from the input.

pasting FAKE_FASTA_CONTENT into a text file does work, but what's going on in this testcase? How does SeqIO do parsing?

fasta biopython • 2.3k views
ADD COMMENT
3
Entering edit mode

You are using an weird mock object - and it is not all that clear what that does. Perhaps that mock object is not as direct replacement as you think. Why not use a StringIO object instead. That is really that is supposed to imitate a string being used as a file.

ADD REPLY
1
Entering edit mode

mock is not weird. I was trying to mock out an open call in a module when writing unittests for it, that's why mock I am using mock. But it appears that implementation of mock_open().readline isn't exactly like a file handle, it's more of an iterator.

Is there a way to do mocking with StringIO, too?

ADD REPLY
1
Entering edit mode

what I meant by 'weird' was that it claims to do something but it does not actually do it exactly the same way - which it turned out to be the case. StringIO is a way to create a filehandle to a string as if it were a file that can then be passed to objects expecting a filehandle. It can be used during testing.

ADD REPLY
0
Entering edit mode

I think it agreeable that in this case, the mock_open object should be improved to be more like a file handle. FYI, it doesn't support __iter__ either, so for line in inf won't work for the mock object unless __iter__ method is explicitly written during testing.

ADD REPLY
1
Entering edit mode
8.0 years ago
-_- ★ 1.1k

I have figured out where the problem is, the relevant code is the this SimpleFastaParser at

https://github.com/biopython/biopython/blob/dbb0de6337d5604ea5a5b2276dfb0ae0d2ddd1dc/Bio/SeqIO/FastaIO.py#L24,

which treats handle as a file handle. However, mock_open.readline() behaves like an iterator, so the last call to handle.readline in SimpleFastaParser raise StopIteration Exception, which makes it return early without yielding the last record. A possible fix is shown here:

https://github.com/zyxue/biopython/commit/a4f53b6990c846519d9344d3d1d92b5891a31c9c, and it passes tests.

ADD COMMENT
1
Entering edit mode

thanks for following up on this, it is what I suspected, the mocked object is does not behave exactly like the object it is supposed to mock - which frankly should be the purpose of mocking.

The fix you show seems that it might have problems - it catches the the StopIteration exception and produces a break in its place. That is always a bad idea. The StopIteration exception is what iterators are supposed to raise to indicate the end of iteration (at least in Python 2 that was the rule). This code might now break under other use cases that expect that stop iteration there.

ADD REPLY
0
Entering edit mode

Hi Istvan, I understand what you mean. Could you give an example/link when StopIteration is expected, and the behavior of more than break is expected?

ADD REPLY
0
Entering edit mode

When an iterator runs out of elements it signals that by raising the StopIteration exception.

It is not an exception in the classical sense of something going the wrong way. It signals a normal state of the iterator. Hence I think one is not supposed to catch that and do something else in its place.

I will say that I have not investigated this piece of code in detail to be able to state with confidence that it will misbehave. It just feels like a fix that makes one problem go away but may introduce another subtler one.

ADD REPLY
0
Entering edit mode

Upon further investigation I have now a better sense what the problem is. SimpleFastaParser is an awkwardly written function to begin with. Instead of iterating over the handle directly it calls the readline() function of it.

Then it puts the whole thing into an infinite loop that it will break out of under certain condition.

So it is not using the handle as an iterator at all. Hence it needs to catch the StopIteration exception. The whole function should be rewritten rather than fixing it here and there.

ADD REPLY
0
Entering edit mode

I agree. SimpleFastaParser does look awkward at first glance. But if it has been tested extensively. It should still be fine.

I also feel it better use duck typing, as long as handle has either __iter__ or readline methods, the parsing shall go through successfully. IMHO, __iter__ feels more preferable as it's more common.

Also, I have found the corresponding code for mock_open here.

ADD REPLY
1
Entering edit mode
8.0 years ago
-_- ★ 1.1k

I settled with using a mock_open-like object, mk_open created myself instead of using mock_open from mock package together with StringIO so that is has a more pure readline behavior.

from StringIO import StringIO

with mock.patch('__main__.open') as mk_open:
    handle = mock.MagicMock(spec=file)
    handle.__enter__.return_value = StringIO(FAKE_FASTA_CONTENT)
    mk_open.return_value = handle

    with open('foo') as inf:
        io = SeqIO.parse(inf, 'fasta')
        for k, rec in enumerate(io):
            print(rec)

Now it works:

ID: 1
Name: 1
Description: 1
Number of features: 0
Seq('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...AAA', SingleLetterAlphabet())
ID: 2
Name: 2
Description: 2
Number of features: 0
Seq('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...AAA', SingleLetterAlphabet())
ID: 3
Name: 3
Description: 3
Number of features: 0
Seq('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...AAA', SingleLetterAlphabet())
ADD COMMENT

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6