Question

Comparison between multiple fasta files in terms of presence or absence of certain sequences

0

Entering edit mode

6.8 years ago

Bernhard.T.Werner • 0

Hallo folks.

I am trying to compare fasta files of predicted miRNAs with one another. This problem does not seem to be very hard for someone more skilled in bioinformatics than me but after hours of forum searches I could not solve the problem.

As an example I have 3 fasta files like:

File_1:

>miRNA1 
AGTCGTCA
>miRNA2 
AGTTCGTT

File_2:

>miRNA1
TTGACT
>miRNA2
AGTCGTCA

File_3:

>miRNA3
TTACATT
>miRNA5
TTGACT

From these files I would like to get a table that shows the presence or absence of each sequence. As you can see the same sequence could have a different ID. In the best case even if there is one mismatch between the sequences they should be counted as equal.

I am sorry to ask such a simple question but i couldnt figure it out by myself. So every help is welcome.

The closest solution i found so far is: Count Sequences In Multi Fasta File .

The script Kenosis provided does a part of the job:

"

use strict;
use warnings;
use File::Basename;
use Sort::Naturally;

local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);

while (<>) {
    chomp;
    $query{ $F[0] } = $F[1] if @F = split;
}

local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;

while (<>) {
    next if /^>/;
    chomp;
    $files{$_}{$ARGV}++;
}

print join( "\t", 'tag', 'seq', @files ), "\n";

for my $key ( nsort keys %query ) {
    my @seqCount;
    push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
    print join( "\t", $key, $query{$key}, @seqCount ), "\n";
}

"

But unfortunately I am unable to adept it to my problem. Because the script compares the sequences by ID and does not allow mismatches it produces many duplicate sequences in the output.

sequence RNA-Seq alignment • 4.3k views

ADD COMMENT • link updated 6.8 years ago by WouterDeCoster 47k • written 6.8 years ago by Bernhard.T.Werner • 0

1

Entering edit mode

Are these long sequences? How many sequences per file? Are they in single-line format, or multi-line?

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

The sequences are short <30bp and the files are in single line format.

ADD REPLY • link 6.8 years ago by Bernhard.T.Werner • 0

1

Entering edit mode

When applying the 1 mismatch between sequences, does this count for length too?

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Yes, this would be perfect.

ADD REPLY • link 6.8 years ago by Bernhard.T.Werner • 0

1

Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I usually like blastclust which is now a legacy software but this post recommending cd-hit might be useful How to cluster sequences present in multiple fasta files

ADD REPLY • link 6.8 years ago by microfuge ★ 1.9k

0

Entering edit mode

Thx for the help. The cd-hit webserver seems to take forever for a small job and can only handle two files. So I dont think this will solve my problem

ADD REPLY • link 6.8 years ago by Bernhard.T.Werner • 0

1

Entering edit mode

6.8 years ago

TriS ★ 4.7k

two approaches you could try are in perl and unix perl: from previous question in biostar UNIX: AWK solution

ADD COMMENT • link 6.8 years ago by TriS ★ 4.7k

0

Entering edit mode

These solutions are a bit over my head. I am not good at programing. I am able to use a script or to combine several scripts and programs. But unfortunately I only understand the simplest scripts. Thx for helping.

ADD REPLY • link 6.8 years ago by Bernhard.T.Werner • 0

0

Entering edit mode

You can try using Sort in MS Excel.

ADD REPLY • link 6.8 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Yes, I could do this. But I have files with hundreds of sequences and to compare them by hand takes ages. That is why I would like to find an automated solution.

ADD REPLY • link 6.8 years ago by Bernhard.T.Werner • 0

score 4 · Accepted Answer · 2017-07-24

Others may have command-line solutions. Here's one in python, untested, that I think should do what you need.

Fasta files:

one.fasta

>miRNA1
AGTCGTCA
>miRNA2
AGTTCGTT
>miRNA3
AGTCGTGA

two.fasta

>miRNA1
TTGACT
>miRNA2
AGTCGTCA

three.fasta

>miRNA3
TTACATT
>miRNA5
TTGACT
>miRNA3
AGTCGTGA

First, make a file with all the unique sequences from each of your fasta files:

cat file1.fasta file2.fasta file3.fasta | grep -v '>' | sort | uniq > all.txt

Save this code as comp_seqs.py , and run as python comp_seqs.py > output.xls in the location that has the three fasta files, and the new file with all the sequences:

#!/usr/bin/env python
import regex
from collections import defaultdict
from itertools import izip

def open_fasta(file):
        """ Open fasta file, and store in dictionary, as {'sequence:header'} """
        s = {}
        with open(file, 'r') as fasta:
                for line in fasta:
                        if line.startswith('>'):
                                h = line.strip().split('>')[1]
                                n = next(fasta).strip()
                                s[n] = h
        return s

def hamming_distance(s1, s2):
    """ Count number of mismatched characters in equal length strings. """
    if len(s1) != len(s2): raise ValueError('string lengths do not match')
    return sum(a != b for a, b in izip(s1, s2))


def find_seqs(fh, p):
        """ Find sequences from 'all sequence list' that either match as identical sequence, or mismatch of 1, store as defaultdict [{sequence:[id1, id2, id3...]}] """
        seqs = open_fasta(fh)
        for i in seqs:
                if i in x:
                        match[i][p] = seqs[i]
                else:
                        for j in x:
                                if len(i) == len(j):
                                        if hamming_distance(i, j) <= 1:
                                                match[i][p] = seqs[i] + '+1'

# Open concated fasta file, containing unique sequences
with open('all.txt', 'r') as a:
        x = [line.strip() for line in a]

# empty defaultdict for reulsts
match = defaultdict(list)

for i in x:
        for n in range(3):
                match[i].append('')


# Open each fasta, and search for sequences in 'all sequence list'
find_seqs('one.fasta', 0)
find_seqs('two.fasta', 1)
find_seqs('three.fasta', 2)

# Print results in tab delimited format
print 'Sequence\tFile1_ID\tFile2_ID\tFile3_ID'
for m in match:
        print m, '\t', '\t'.join(match[m])

Output from example above:

Sequence    File1_ID    File2_ID    File3_ID
AGTCGTCA    miRNA1  miRNA2  
TTGACT      miRNA1  miRNA5
AGTTCGTT    miRNA2      
AGTCGTGA    miRNA3      miRNA3
TTACATT             miRNA3