Biostar Beta. Not for public use.
Merge "NCBI species" data within Ensembl alignment
2
Entering edit mode
2.3 years ago
tlorin • 250
Switzerland

Apologies if this is a really naive question, but I cannot figure out how to do this easily. Here is a related post regarding the best method to find orthologous genes of a species.

Let's say I have a protein alignment downloaded from Ensembl (coming, for instance, from this Ensembl tree).

This gene is present in some other "NCBI species" that I would like to include in my tree (for instance, Stegastes partitus, with available genome and present in the NCBI database but NOT in the Ensembl database). Indeed, if I manually blastp asip protein sequence of D. rerio (extracted from my Ensembl multifasta protein alignment) onto nr database parsed for S. partitus, I find this sequence, corresponding to the first blast hit. Perfect! And I can manually append it to my initial protein tree.

Where the problem starts is that I don't have one gene and one NCBI species but many of them (let's say p genes and n NCBI species). I already have an Ensembl protein multifasta file for each of my p genes.

My question is: is there an easy way to append to each of my p multifasta files the corresponding homologous protein sequence(s) of the n "NCBI species"?

Thanks for any insight!

0
Entering edit mode

Since no one has said anything I will take a stab.

I don't think it would be possible to easily script what you are asking for. There are decisions that need to be made about what to select (from a different site/database) and then add that information to a second site.

0
Entering edit mode

Like genomax said, this isn't trivial.

What you may want to do is consider working the other way, take your known aisp example (e.g. zebrafish) and blast against some list of species you are interested in. I'm picturing something where you would have a list of taxonomy IDs for species you're interested in, then blast your reference against the NR database filtered against each species. If I recall correctly, you can setup the blast output format to include the sequence of the hit.

After that, you'd have to reconstruct the tree, which is a whole different issue. I'm not sure how you are manually adding things to your tree.