Sequence similarity scores between two sets of genes from different genomes
3
1
Entering edit mode
7.1 years ago
avaneesh.t ▴ 20

I have a set of genes from the Yeast genome (~3000) and a set of genes from Human genome (~6000). I want to align each yeast gene against each human gene, and get a similarity score for each pair. The lengths of the genes would be different, many pairs would be very dissimilar.

1) How would I go about doing this, say with R ? 2) Are there some specific things i should take into while doing my analysis?

sequence similarity alignment genome • 2.2k views
ADD COMMENT
0
Entering edit mode

Do you need to achieve this in R? There are loads of great commanline utilities for alignment

ADD REPLY
0
Entering edit mode
7.1 years ago
Benn 8.3k

Maybe inParanoid can help you further?

http://inparanoid.sbc.su.se/cgi-bin/index.cgi

There are R bioconductor libraries available, but limited (only 1 yeast species: S. cervisiea).

https://bioconductor.org/packages/release/BiocViews.html#___InparanoidDb

ADD COMMENT
0
Entering edit mode

Inparanoid and other databases give me a list of orthologs. While this would help me validate my pariwise "similarity scores" (orthologs should have higher similarity scores?), they do not tell me how similar non-orthologous genes are.

ADD REPLY
0
Entering edit mode

Do you want 3000 x 6000 similarity scores (18 M)??

ADD REPLY
0
Entering edit mode

Yes. That is the idea. Though, now that you bring that up, I should probably try and target a smaller subset.

ADD REPLY
0
Entering edit mode

It is possible to do these 18M alignments by your computer, but how to interpret the results is something to consider.

If you want to do these 18M pairwise alignments, you can use EMBOSS command line tool for it. Depending on if you like global or local alignment, you can use needle or water, respectively. The results will also contain identity for each pair, so you'll need some bash skills to extract them in the right way (e.g., using GREP).

For example:

needleall -auto true -asequence yeast.fasta -bsequence human.fasta \
-datafile EDNAFULL -outfile yeast_human.needleall -aformat markx0

grep "Identity:" yeast_human.needleall > yeast_human.needleall.identity
ADD REPLY
0
Entering edit mode
7.1 years ago
Charles Yin ▴ 180

For a large set of genomes, alignment may not work since it takes very long time. You may consider to use alignment free method. My paper is as follows with MATLAB code available, the link to the programs is inside the paper. The method can process different lengths of DNA sequences (even scaling).

Yin, C., Chen, Y., & Yau, S. S. T. (2014). A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. Journal of theoretical biology, 359, 18-28.

ADD COMMENT
0
Entering edit mode
7.1 years ago
Charles Yin ▴ 180

Also please check this paper for the improved method for even scaling and code.

Yin, C., & Yau, S. S. T. (2015). An improved model for whole genome phylogenetic analysis by Fourier transform. Journal of Theoretical Biology. doi:10.1016/j.jtbi.2015.06.033

[https://www.mathworks.com/matlabcentral/fileexchange/52072-phylogenetic-analysis-of-dna-sequences-or-genomes-by-fourier-transform][1]

ADD COMMENT

Login before adding your answer.

Traffic: 1463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6