Similarity between two FASTA files
4
0
Entering edit mode
5.7 years ago
firestar ★ 1.6k

I have short FASTA files (~400nt) of differing lengths. I would like to compare/align them and get a similarity metric. I am not interested in the exact alignment or alignment output. I just need one value that shows how similar they are. Perhaps a distance like hamming distance would work. I want to do this pairwise with loads of sequences. Can anyone suggest a bash command line tool?

similarity distance • 7.0k views
ADD COMMENT
1
Entering edit mode

I haven't tried all the suggestions here, but I came across this BLAST usage:

blastn -query "query.fa" -subject "ref.fa" -task "blastn" -outfmt 6 >> blast_results.txt

And this returns

query     ref    per_ident aln_len mismatch gapopen qstart  qend sstart send evalue   bitscore
query.fa  ref.fa 74.206    252     46       3       1       248  1      237  9.65e-46 165

of which percentage identity (74%) seems to be a usable measure of similarity. I am not sure if this is the best metric for global sequence-to-sequence similarity.

ADD REPLY
1
Entering edit mode

It’s definitively not a measure of global sequence similarity.

The L in BLAST stands for “local”.

This is broadly the approach a few of us have suggested (use alignment and take an alignment score of choice as your metric), but global alignments are probably better suited to your question (though it would be worth you showing us some examples of your sequences).

ADD REPLY
0
Entering edit mode

Ah yes! Good point. But, does local matter for short sequences of few hundred nucleotides?

ADD REPLY
1
Entering edit mode

It might matter, it might not - “short” is all relative. 400bps might seem short, but its more than long enough to carry an entire promoter sequence and various other untranslated features. How you align these would matter if you were interested in such structures.

It sort of depends what you’re trying to show. What your BLAST results are telling you is that, of the amount of sequence it could locally align (in your case 252 nucleotides of your 400ish), they are ~74% identical.

Now, what this tells me is that only a little over half of your sequence is similar to the other. It says almost nothing about the remaining 150 bases that haven’t been aligned.

If you were only interested in a domain or structure or something that fell within that 252, then you could carry on, but otherwise you are ignoring almost a third of the sequence if you take that metric.

ADD REPLY
0
Entering edit mode

I am comparing human X and Y chromosomes. Y is rearranged and jumbled up. But I have already identified short regions in X that correspond with Y. Now I just need to figure out how similar they are. Here is an example

>chrX:153598967-153599086
TCCATGATGAACCTCGCCACTCGGAAGTGCACCCTGGCTTCGGGCGCCGGCTGCCCCTCC
GGGCCCCGCGCTGCAGGCCCCCCTCCGCCGCCCTCCGGGGCTTCCATGAGGCGCCGCGGC
>chrY:26627155-26627274
TCCATGATGAACCTCGCCGCTTGGAAGTGCACCCTGGATTCGGGCGCCGGCTGCCTCTCT
GGGCACCGCGCTGCGGACCCCACTCCGCCGCCCTCCCAGGCTTCCATGAAGCTCCGGGGC

blastn returns this as 88.3% percentage identity. The lengths are same in this particular example but lengths can vary a bit.

ADD REPLY
2
Entering edit mode

@rmf, BLAST isn’t telling you those 2 sequences are 88% identical.

As my comment above, its telling you how similar the maximally aligning subsequence it could find between the 2 sequences are.

Now, I don’t know if your BLAST results did manage to align the full length of those sequences, but I doubt it.

BLAST is not really the tool you want for this task.

ADD REPLY
2
Entering edit mode

blast is definitively not the right tool for pair wise alignment, it was design to quickly scan thru a db. For proper alignment you'll have to use Smith-Waterman for local alignment or Needleman-Wunsch for global. Both are in the EMBOSS linux tool.

ADD REPLY
1
Entering edit mode

In general this is indeed an OK approach.

You just need to keep in mind that it only reports something in relation to the part that can be aligned! so it might report 80% but only align 50% of the sequence (which is then does clearly something different than on 80% overall similarity)

ADD REPLY
0
Entering edit mode

Are the short regions you already identified of somewhat similar length? (taking your original question in mind it seems not)

ADD REPLY
0
Entering edit mode

They are comparable in lengths. The length difference range from 0 to 83 nt.

ADD REPLY
2
Entering edit mode

In that case I might consider using a global aligner , such as NW from EMBOSS . For a global aligner the rule is that they have to be of similar size not strictly of equal size

ADD REPLY
0
Entering edit mode

I just need one value that shows how similar they are

An alignment would give you similarity score which you may be looking for

ADD REPLY
1
Entering edit mode

It's a simple question/request but it might turn out more complex than you hoped for I'm afraid.

Eg. a 50bp seq is 100% part of a 100bp sequence : what would be the expected similarity score here? 1 or 0.5?

It's indeed the

of differing lengths

that makes this a lot more complicated.

That's why most methods/approaches will make the 'equal length' assumption

ADD REPLY
0
Entering edit mode

Just to clarify, are you also interested in their global similarity, or regions of local similarity? (It sounds like the former at the moment)?

ADD REPLY
6
Entering edit mode
5.7 years ago
Benn 8.3k

To get similarity you'll need alignment first, even if you are not interested in it. You can for instance use the identity, similarity, or other scores. Try EMBOSS on linux, it has many alignment options.

ADD COMMENT
2
Entering edit mode

@ rmf, the above answer is the one you need.

Download the EMBOSS commandline tools from here: http://emboss.sourceforge.net/download/

intall, and run:

needle -asequence <fasta1.fa> -bsequence <fasta2.fa> -gapopen 10 -gapextend 0.5 -o <outfile.needle>
ADD REPLY
1
Entering edit mode

I'm with b.nota on this one. I don't think you can bypass the alignment step here, unless you're totally not interested in the "biological" interpretation, but even then ....

FastA might also be worth considering

ADD REPLY
3
Entering edit mode
5.7 years ago
firestar ★ 1.6k

Comparing pairwise similarity measures between ~3000 fasta pairs in the size range 0-10000 nt.

blast_identity: Percentage identity as returned by BLASTN

blastn -query query.fa -subject ref.fa -task blastn -outfmt "6 pident nident"

nw_identity: Percentage identity as returned by needle function in emboss
nw_similarity: Percentage similarity as returned by needle function in emboss

needle ref.fa query.fa -gapopen 10 -gapextend 0.5 -outfile needle_temp.txt

mash_dist: Distance returned by mash (-k 12)

mash dist -k 12 ref.fa query.fa

enter image description here

As mentioned in the comments, needleman-wunsch global alignment is probably the best metric for this purpose. Thank you all for the insightful comments!

ADD COMMENT
0
Entering edit mode

rmf : You should accept @5heikki's mash answer as well since you made use of that in your final solution.

ADD REPLY
0
Entering edit mode

@genomax I used emboss needle as the final solution which is why I accepted that as the answer.

ADD REPLY
0
Entering edit mode

You can accept more than one answers.

ADD REPLY
0
Entering edit mode

Oh. I wasn't aware of that. Good to know. Thx!

ADD REPLY
0
Entering edit mode

You can also accept your own answer since it brings together all elements in one.

ADD REPLY
0
Entering edit mode

Kudos for thoroughly doing the comparisons and reporting back!

I'm a little surprised that the BLAST distributions look the same as the NW!

ADD REPLY
1
Entering edit mode

Probably because there are very little gaps in alignments. Also I suppose the mash distance could be improved by setting a better value for the kmer length. I used 12 arbitrarily.

ADD REPLY
0
Entering edit mode

Yeah that would be my guess too, if the sequences were less similar you'd start to see a bigger gulf appearing between local and global probably.

ADD REPLY
2
Entering edit mode
5.7 years ago
5heikki 11k
mash dist file1.fna file2.fna

I recommend smaller than default k-mer for small files, e.g. 17

Mash

ADD COMMENT
0
Entering edit mode

I thought about mash too, wasn't sure how well it works for very small seqs - but this is good to know!

ADD REPLY
0
Entering edit mode

Well I don't know how well it works with very small files, maybe if setting k-mer length even shorter..

ADD REPLY
2
Entering edit mode
5.7 years ago
Joe 21k

A Hamming distance would have been my suggestion, though that only works for strings of identical length.

You might be better off using the Levenshtein Distance. There are lots of python implementations of it around the web. As b.nota mentioned, best to align them first, regardless (in which case you could just use the alignment score as your metric).

Edit, here's a snipped from one of my own scripts which can calculate Hamming distances:

def hamming_distance(s1, s2):
     """Return the Hamming distance between equal-length sequences"""
     if len(s1) != len(s2):
         raise ValueError("Undefined for sequences of unequal length")
     return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

And a whole wiki page with algo implementations:

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

ADD COMMENT

Login before adding your answer.

Traffic: 2611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6