I have a question about BLAST+ and the "feature" where identical results get collapsed. (I'm using blast locally, version 2.2.30+ on a Ubuntu 12.04 desktop.)
For example, when I do a blastx with the human factor IX sequence against refseq_protein, a few results down, I see this output:
>ref|XP_005594774.1| PREDICTED: coagulation factor IX [Macaca fascicularis]
ref|XP_011739181.1| PREDICTED: coagulation factor IX [Macaca nemestrina]
Length = 461
# (... stats/alignment here... )
I have two questions. First, why does this collapsing occur? Even if the protein sequence is identical, the fact that the two entries have unique accession numbers indicates to me that the RefSeq data base should be considering them separately (though in this blog post [http://blastedbio.blogspot.com/2012/05/blast-tabular-missing-descriptions.html], he notes that this phenomenon is occurring in his example due to nr, not blast). Is this a "feature" of RefSeq, blast itself, or both?
Second, how can I avoid this problem? For the project I'm working on, I'm actually interested in seeing every species represented in the list, no matter if their protein sequences are identical. Is there some option in blast that I can configure to disable this behavior? If not, I suppose I'll likely have to output the data in another format and then parse it in Python (or similar); is there an easy way to do this? The tabular output format (outfmt 6), with options 'salltitles' and 'sscinames' has gotten me the closest to what I want to see, but the entries for these collapsed results appear on a single line and make it hard to process both by eye and programmatically. I don't think what I'm looking for is particularly complicated, nor do I imagine it's an uncommon desire, but despite considerable time spent Googling, I still can't find an answer for why this happens or how to avoid it.
Thanks!
You may "accept" your own answer - thus indicating to future readers a good solution - as it indeed solves your problem.
Good to know. Thanks!
It should be done cautiously, as it is easy to misuse it, but in this case I think it is warranted.