Question

Ideal sequence % identity for profile construction

0

Entering edit mode

8.4 years ago

Anand Rao ▴ 630

Any multiple sequence alignment (MSA) can be converted to a profile HMM (pHMM). And I DO understand that mathematical modeling of the diversity at each alignment position in an MSA can be used to score matches using something like HMMER2 / HMMER3 / HHpred etc.

However, I am curious to know if there are established guidelines for what % identity amongst sequences should be ideally, in order to balance signal and noise in the pHMM, so that both sensitivity and specificity of detecting sequence homologs are as high as possible.

I could argue that an MSA composed of sequences that are < 20% pair-wise identity would be hard to justify without solid evidence of structural or functional equivalence despite poor sequence conservation. So where should I stop in terms of diversity of sequences during MSA inference, if I am going to build pHMMs from these MSAs?

Links to any published literature on this topic would be much appreciated. Thanks folks!

hmmbuild sequence-identity profile-HMM HMMER3 • 2.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Anand Rao ▴ 630

Ram · Answer 1 · 2015-12-11

0

Entering edit mode

8.4 years ago

5heikki 11k

Pfam has been around for quite some time, so perhaps it would be a good idea to read up on their methodology? My guess is that there's no universal optimal value..

ADD COMMENT • link 8.4 years ago by 5heikki 11k

0

Entering edit mode

I've analyzed seed sequences for 14,831 Pfam profiles, and indeed as you suspect, there is no uniform average pairwise sequence % identity for these profiles. Some of them are really low (< 20%). How can you infer an accurate MSA when % identity is so low? False positive rates in such cases are very high. So I might question the validity of these MSAs and the pHMMs inferred from them - doesn't matter if PFam builds them or I build them!

At least that is my current stance. But I would love for someone to correct me or educate me on this aspect. Thanks for your reply.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Anand Rao ▴ 630

0

Entering edit mode

If I remember correctly, they somehow control the false discovery/positive rate by calculating the p-Values with respect to the protein family, i.e., each pHMM has an adjustment associated. Search for "gathering threshold"...

However, for mathematical modelling in general, there is no need for a good overall similarity. It is enough to identify the features that are unique for a particular family/group/whatsoever. Hypothetically, imagine that a particular sequence of 10 amino-acids out of a 1000 AA protein is unique to all proteins carrying out a specific function while no other protein happens to have this sequence... then you need to train your profile to target exactly these 10 AA... not more not less. The remaining AA sequence does not matter, but a 1% overall sequence similarity is enough to answer your question.

By the way, this reduction of data dimensions happens all over in Bioinformatics, from biomarker discovery (ignore genes that are not a different between control and experiment) to sequence classification (remove uninformative sequence-parts)...

ADD REPLY • link 8.4 years ago by Manuel Landesfeind ★ 1.4k