What to use for a measure of distance between two sequences?
2
0
Entering edit mode
10.0 years ago
evsmithx ▴ 10

I've got a lot of genetic sequences and I'd like to use various clustering algorithms on them, but this requires a measure of the distance between two sequences. I've used BLAST+ and the Needleman-Wunsch algorithm, but both give measures of similarity rather than distance (i.e. two similar sequences have a large similarity score, and a low distance). I've found methods for whole genomes, but here I just want it for pairs of genes.

Is there a good way to get distance from similarity score? Ideally I'd like something where two identical sequences have a distance of zero and distance from sequence A to sequence B is the same as B to A.

Or is there some other method that finds a distance directly from the sequences (without first computing similarity)?

There's a variety of ways I can think of combining similarity scores to give something a bit like a distance (e.g. D(A, B) = 1 / similarity(A, B), or D(A, B) = min(sim(A,A), sim(B,B)) / sim(A,B) - 1) but I'm sure someone must have done this before and have a better solution! All help greatly appreciated.

alignment blast genetic-distance • 8.3k views
ADD COMMENT
3
Entering edit mode
10.0 years ago

A popular distance metric is the Levenshtein distance. There's a python package for it and I guess you could use it as follows to compute the distance between pairs of strings:

import Levenshtein as lv
1 - lv.ratio('ACTG', 'ACTA')
>>> 0.25

Just a thought...

ADD COMMENT
2
Entering edit mode
10.0 years ago

I think that conventional way of thinking about the correlation between similarity and distance is the following:

Distance = 1 - Similarity
ADD COMMENT

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6