Biostar Beta. Not for public use.
What kind of protein matrix substitution use for blastX divergent sequences
0
Entering edit mode
15 months ago
Darill • 30

I'm actually trying to find some homology between sequences inside a genome and several viral proteins in a database. As you all know, virus evolve very quickly, so I'm expecting to find a lot a divergences between the sequences I compared.

In the literature I see that BLOSUM < 62 are used for relatively divergences sequences and the same for PAM matrix around 250.

I tried all the matrix available and here are the results for a particular sequence that have to be found by the blastX (a real positive one). The e-value appears very high because the database is quite big I guess ( around 150 000 proteins seqs).

BLOSUM 90
scaf1   104 viral_seq   71.4    14  4   0   401993  401952  19  32  1.8e+01 38.0

BLOSUM 80
scaf1   104 viral_seq   71.4    14  4   0   401993  401952  19  32  2.0e+01 37.9

BLOSUM 65
scaf1   104 viral_seq   21.5    107 78  1   88433   88753   2   102 1.7e+01 38.1

BLOSUM 45
scaf1   104 viral_seq   21.5    107 78  1   88433   88753   2   102 9.8e+01 35.6

PAM250
scaf1   104 viral_seq   21.5    107 78  1   88433   88753   2   102 1.2e+02 35.3

PAM70
scaf1   104 viral_seq   71.4    14  4   0   401993  401952  19  32  1.5e+00 41.7

PAM30
scaf1   104 viral_seq   71.4    14  4   0   401993  401952  19  32  2.4e-01 44.3


So I wondered, since the PAM30 displays the best evalue for the protein, does it mean that it is the best matrix to use for my data?

It is weird because I was expected to find a more relevant result with a matrix for more distantly related species such as BLOSUM45 (but here the e-value is even higher) because I'm working on virus proteins and they evolve very much quickly...

Do you know what are the best options in order to find this hit and keep a correct e-value even if the db is big?

I'm using the program Diamond by the way which is a blast program running faster than blast.