Question

Using R to calculate diversity between hundreds of orthologous sequences

1

Entering edit mode

9.3 years ago

Adrian Pelin ★ 2.6k

Hello,

I am interested in calculating diversity for a large amount of genes in a given phylum.

What I did, is I took all genes from my organism in question and found true orthologues using inParanoid in 3 different taxa. I now have a table that looks like this:

LOC1 Taxa1_Orth_LOC1 Taxa2_Orth_LOC1 Taxa3_Orth_LOC1
LOC2 Taxa1_Orth_LOC2 Taxa2_Orth_LOC2 Taxa3_Orth_LOC2
...

In column 1 I have the name of the locus in my organism, in columns 2-4 I have the name of the true orthologues locus in taxa 1 2 and 3.

I also have fasta files with the ORFs of all loci from all my different taxa. So for Taxa #1 I have a fasta file with the sequences Taxa1_Orth_LOC1 and Taxa1_Orth_LOC2 and so on....

Now, since I have the fasta files and the table of true orthologues, how can I calculate diversity using R in a quick manner? I know there are ways of doing it in codeml, but setting up each alignment will be a very difficult task.

Any thoughts on how this can be done?

Thank you,

Adrian

R diversity alignment • 2.4k views

ADD COMMENT • link updated 9.3 years ago by Siva ★ 1.9k • written 9.3 years ago by Adrian Pelin ★ 2.6k

Ram · Answer 1 · 2015-01-16

If you are willing to consider other options than R, I would suggest using needle or needleall from EMBOSS. This does a global pairwise alignment using Needleman-Wunsch algorithm for sequence sets and reports global sequence similarity and identity between two sequences. Both these programs take two input sequence files.

If you want to compare a sequence from your species of interest against each of its orthologs, use needle. Create a file with the sequence from your species of interest and another file with its ortholog sequences.

If you also want to compare the orthologs sequences among themselves (all-against-all), use needleall. Create a multi-FASTA file of all sequences belonging to a single orthologous group and use the same file as the two input sequences. There might be redundant comparisons (seq1 vs seq2 and seq2 vs seq1).

Just a friendly suggestion about terminologies ("true orthologues"). We cannot infer homology from sequence-similarity based methods such as BLAST (inParanoid uses BLAST). The best we can call the hits as putative homologs. We need to do phylogenetic analyses to talk about homology.