Question

What to use for a measure of distance between two sequences?

0

Entering edit mode

10.0 years ago

evsmithx ▴ 10

I've got a lot of genetic sequences and I'd like to use various clustering algorithms on them, but this requires a measure of the distance between two sequences. I've used BLAST+ and the Needleman-Wunsch algorithm, but both give measures of similarity rather than distance (i.e. two similar sequences have a large similarity score, and a low distance). I've found methods for whole genomes, but here I just want it for pairs of genes.

Is there a good way to get distance from similarity score? Ideally I'd like something where two identical sequences have a distance of zero and distance from sequence A to sequence B is the same as B to A.

Or is there some other method that finds a distance directly from the sequences (without first computing similarity)?

There's a variety of ways I can think of combining similarity scores to give something a bit like a distance (e.g. D(A, B) = 1 / similarity(A, B), or D(A, B) = min(sim(A,A), sim(B,B)) / sim(A,B) - 1) but I'm sure someone must have done this before and have a better solution! All help greatly appreciated.

alignment blast genetic-distance • 8.3k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by evsmithx ▴ 10

Ram · Answer 1 · 2014-06-03

3

Entering edit mode

10.0 years ago

dariober 14k

A popular distance metric is the Levenshtein distance. There's a python package for it and I guess you could use it as follows to compute the distance between pairs of strings:

import Levenshtein as lv
1 - lv.ratio('ACTG', 'ACTA')
>>> 0.25

Just a thought...

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 10.0 years ago by dariober 14k

Ram · Answer 2 · 2014-06-03

2

Entering edit mode

10.0 years ago

Gergana Vandova ▴ 170

I think that conventional way of thinking about the correlation between similarity and distance is the following:

Distance = 1 - Similarity

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 10.0 years ago by Gergana Vandova ▴ 170