BLASTp all-by-all run parameters & post-processing tools
0
0
Entering edit mode
8.3 years ago
Anand Rao ▴ 630

My dataset contains ~ 20K proteins that are multi-domain and exhibit extensive length variation. As a result, homology detection tools are not generating clusters of high quality with truly homologous sequences...the sequence alignment of proteins within each inferred cluster can be very gappy and crappy. With that as context, here are my questions for forum members:

1. Are there useful run parameters while performing my BLASTp all-by-all?

2. Are there useful post-processing tools after generating my BLASTp all-by-all tabular output?

I have come across suggestions elsewhere, such as % query coverage >= 90% and % subject coverage >= 90%.

Please let me know how I may improve the quality of my tabular output to MCL so that the clusters contain homologous sequences - no more, no less. I do realize my dataset is very complex, so I am hoping to improve my results, without any delusional hopes for the perfect solution!

BLAST coverage length clustering homology • 1.8k views
ADD COMMENT
0
Entering edit mode

You could try changing the substitution matrix. Typically BLOSUM62 is used by default and tends to give good results for low similarities but you could also try BLOSUM45 which should be more suitable for very low sequence similarities.

As an alternative, if possible, you could try to assign your proteins to existing gene families using HMM profiles, e.g. for animal proteins, you could use TreeFam HMMs and the treefamscan tool.

ADD REPLY

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6