Entering edit mode

I would like to find a way to automate identifying whether sequences, or a sequence, within clusters of seemingly homologous sequences (in other words sequnces grouped together by multiple sequence alignment (MSA) algorithm) in my MSA genuinely belong to that group or not. This is for the sake of identifying whether they represent the same protein as their group and as to whether they should be collapsed into the same branch as their group on my tree.

I think that PCA could be a potential way to do this, however, I am not sure if it is. The way I imagined this could be done is by calculating PCA for the defined groups in my MSA and then by choosing a threshold eigen value to detremine which sequences could be grouped together, i.e. automatically detecting which sequences may be potentially representing a different/same protein to the one the group they are assigned to by the MSA.

Please suggest if that is a good way to do it or not

Entering edit mode

Perhaps the method described in this paper would be sufficient for this case, though you'll have to conduct some experiments with various threshold distance values. First, if you haven't already done it, you'll need to encode your nucleotides, perhaps like so:

- Encode each nucleotide in a value 0-4 (or 1-5) {A, C, G, T, -} = {0, 1, 2, 3, 4}, you could also center the matrix at this time
- Using each MSA, create a basis for that MSA matrix using the singular value decomposition
- Choose the first d principal eigenvectors from the left singular vectors and project the same MSA matrix onto them
- Randomly select and remove one sequence whose distance to the projection is above your user-defined threshold value
- Repeat until satisfied with your results

I believe you should have the sequence vectors be the columns of your matrix for the construction of the basis.

Loading Similar Posts