Biostar Beta. Not for public use.
percent identities for all by all protein alignment
0
Entering edit mode
10 months ago
bioguy • 30

Very simple question – I have a list of, say, 1000 protein sequences in a fasta file. I want to compute percent identities for each pair of sequences. So, in this example, it'd generated a 1000 by 1000 matrix reflected over the diagonal.

I've tried diamond and blastp, but even when I set -max-target-seqs to 0 or set the evalue/min percent identity to appropriately high/low values, they still don't report all possible alignments. There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

Any recommendations for how to get this done quickly? I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

Thanks in advance.

ADD COMMENTlink
2
Entering edit mode
15 months ago
fishgolden • 390

There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

I think this is really difficult task since the reason why those programs (at least BLAST) are fast is that they have some filtering steps.

In my opinion, one of the critical filters still remaining in your protocol with BLAST is that the "word size" & I don't know about DIAMOND very much, sorry.

Anyway, I recommend water or needle in EMBOSS package http://emboss.sourceforge.net/ . Or SSEARCH, or GGSEARCH in FASTA package https://github.com/wrpearson/fasta36 for your purpose.

ADD COMMENTlink
0
Entering edit mode
6 weeks ago
Joe 12k
United Kingdom

Neither DIAMOND nor BLAST are the tool for this really.

Just iterate over your file in an exhaustive pairwise approach, and use a normal aligner like MAFFT or CLUSTAL which can give you a score.

I'd advise looking at ways to parallelize this as you have an n choose k == 499500 problem here.

ADD COMMENTlink
1
Entering edit mode

Thank you – ended up going the CLUSTAL route with some basic multi-threading to speed things up. Very much appreciate your help!

ADD REPLYlink
0
Entering edit mode

No problem ;)

ADD REPLYlink
0
Entering edit mode

CLUSTAL and MAFFT are Multiple Sequence Alignment software that they perform stringent DP (edit: jrj.healey said CLUSTAL's alignment is a bit janky Its alignment may not be stringent one...) same as water and needle when they are used as pairwise aligner. I mean, I think CLUSTAL and MAFFT are over spec. But it is ok because the results would be the same when the same parameters were used, i guess.

ADD REPLYlink
1
Entering edit mode

Clustal begins by performing all vs all pairwise alignments anyway, but other threads on the site have shown that its pairwise alignments can be a bit janky. Using needle or water directly is likely a good approach too. In my experience, MAFFT gives better DNA alignments than does CLUSTAL anyway.

(Those were just the first 2 to come to mind, there are many other aligners to experiment with :) )

ADD REPLYlink
0
Entering edit mode

but other threads on the site have shown that its pairwise alignments can be a bit janky.

Oh, really. I didn't know that. Thank you for the comment.

ADD REPLYlink
0
Entering edit mode

I'm mainly referring to this one: what is the problem with using clustal to do pairwise alignment?

But its far from a systematic evaluation, so YMMV.

ADD REPLYlink
0
Entering edit mode

In Clustal Omega, the alignments are then computed using the very accurate HHalign package (Söding, 2005), which aligns two profile hidden Markov models (Eddy, 1998).

http://msb.embopress.org/content/7/1/539

It seems that Clustal Omega uses HHalign.

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

ClustalW is using DP by default.

http://www.clustal.org/download/clustalw_help.txt

ADD REPLYlink
0
Entering edit mode

Ah ok, thank you. Will give the others you recommended a try as well if need be. Really appreciate all the help!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1