Compairng multiple sequence and finding similarity.
1
0
Entering edit mode
7.7 years ago
EVR ▴ 610

HI,

I have two fasta files with same seqnames but with slightly different sequence names like follows:

File_1
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGATGA
    >trinity5_comp_645_0_2
    TCGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC
FIle2
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGCCGA
    >trinity5_comp_645_0_2
    TAGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC

I would like compare each sequence of File1 with respect to corresponding sequences in File2 and compute its percentage of similarity like follows:

trinity5_comp_3_0_1 100%
trinity5_comp_9735_1_5 80%
trinity5_comp_645_0_2 90%
trinity5_comp_3169_0_1 100%

I tried using cd-hit-est-2d but sequences are also compared with other sequences rather than its own corresponding sequences in file2. Kindly guide me.

Thanks in advance

DNA_Sequences Fasta Sequence_comparision • 2.1k views
ADD COMMENT
0
Entering edit mode

You can blast your sequences, with the flag -max_target_seqs = 1, and also filter by % identity or set an e-value threshold.

ADD REPLY
0
Entering edit mode

But it doesn’t guarantee you that it will blast against its corresponding sequence with the same header

ADD REPLY
0
Entering edit mode

True. He is missing the gene cluster ids for each isoform in the assembly, however. @Tom, why do you have two trinity fasta? Are you comparing two different assemblies?

ADD REPLY
0
Entering edit mode

These two fasta files are generated using different approaches and I want to compare how different are they by comparing their percentage similarity

ADD REPLY
0
Entering edit mode

Do they follow same order always ?

ADD REPLY
0
Entering edit mode

Yes they do follow same order

ADD REPLY
0
Entering edit mode

How about splitting the two files into individual sequences and then compare pair wise manner using cd-hit-est-2d ?

ADD REPLY
0
Entering edit mode

Hi, I thought of doing that, but i have around 3000 sequences which could be tiresome. But anyhow I will try that method too

ADD REPLY
0
Entering edit mode

Its not tiresome, I updated my answer.

ADD REPLY
1
Entering edit mode
7.7 years ago

I do not know what is the scale of your data, but something like calculating edit-distance might help you.

from Bio import SeqIO
import editdistance,sys

fa1 = SeqIO.parse("1.fa", "fasta")
fa2 = SeqIO.parse("2.fa", "fasta")

for rec1,rec2 in zip(fa1,fa2):

    if ( rec1.id != rec2.id ):
        print "The sequences are not in order. Exiting"
        sys.exit()

    length=len(rec1.seq)
    dist=editdistance.eval(rec1.seq,rec2.seq)
    print rec1.id,100-((dist*100)/length)

output:

trinity5_comp_3_0_1 100
trinity5_comp_9735_1_5 80
trinity5_comp_645_0_2 90
trinity5_comp_3169_0_1 100

Assumes the sequences are in same order always. If your fasta sequences are very long, this is definitely not a good solution.

If you want to run cd-hit-est-2d on each pair of sequences:

Make a backup of your files.

cp 1.fa 1.fa.bkp
cp 2.fa 2.fa.bkp

Add a prefix to your fasta header. Check how it works for your system.

sed -i '' -e '/^>/s/$/.1/g' 1.fa
sed -i '' -e '/^>/s/$/.2/g' 2.fa

Then split the fasta file into individual sequences.

for fa in {1,2}.fa
do
curl https://raw.githubusercontent.com/gouthamatla/fasta_File_Manipulation/master/SplitFastaFile.py | python - ${fa}
done

Now you have all your sequences split with .1.fasta and .2.fasta suffix.

Then run cd-hit-est-2d in pairwise manner:

for sample in *.1.fasta 
do
base=`basename  ".1.fasta"`
echo "cd-hit-command $base.1.fasta $base.2.fasta" | parallel --jobs 4
done
rm *.1.fasta *.2.fasta

change the --jobs according to available cores.

ADD COMMENT
0
Entering edit mode

I tried your python command, it worked. but as I am not that familiar with python, I wanted to write it to file, so I used file.write() but it throwed me error saying that TypeError: write() takes exactly 1 argument (2 given). How can make the python script to write to file. If possible a bound python script which takes two fasta file as input and write it into a file.

ADD REPLY
0
Entering edit mode

Python program definitely works but you should know how it is doing. To write to a file,

python script.py > out.txt
ADD REPLY
0
Entering edit mode

I already figured out. I tried to delete that comment but I couldnt do it. thanks for your guidance.

ADD REPLY
0
Entering edit mode

Please validate the results with couple of sequences. Take few random pairs of sequences are run cd-hit on them to compare the outputs. It's very important.

ADD REPLY

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6