Question

How can I add the specie associated name to ensembls ID in a fasta file (cds) ?

0

Entering edit mode

5.8 years ago

dtejadamartinez ▴ 20

Hi,

I download the orthologues fasta file (cds) for one gene in Ensembl, but the fasta file just have the ensembls ID and not the specie associated name.

How can I add the specie associated name to ensembls ID in a fasta file ?

(I need do that for hundred of genes)

Thanks,

Ensembl • 1.9k views

ADD COMMENT • link updated 5.8 years ago by finswimmer 16k • written 5.8 years ago by dtejadamartinez ▴ 20

1

Entering edit mode

You can tell a lot of an Ensembl Gene ID as @Erin Lim pointed you to the Ensembl help page.

The ID will have a three letter code e.g. MUS for mouse (latin name is Mus musculus) for the BRAF orthologue in mouse: ENSMUSG00000002413.

So if you know the 3 letter code, you know the species name in your FASTA file. It's that easy. If you don't know what the 3 letter code means, check Ensembl stable ID prefixes.

ADD REPLY • link 5.8 years ago by Denise CS ★ 5.2k

0

Entering edit mode

https://useast.ensembl.org/Help/Faq?id=488

By species associated name, I assume you meant gene symbol? You can parse the GTF yourself or use services like BioMart.

ADD REPLY • link 5.8 years ago by Eric Lim ★ 2.1k

0

Entering edit mode

I think what dtejadamartinez wants is to get the names in orthologs file that one can download from Ensembl comparative genomics page. Here is one example. Click on Download orthologues button and then select fasta format.

dtejadamartinez : If you use one of the other formats you should be able to get species names.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

an example CDs and expected output would help better

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

I am reasonably sure that this is a pre-formatted file pre-computed by ensembl. I have an example posted in my comment above.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks,

If in Ensembl I select another format (not FASTA) it doesn't retrieve the option to download the cds

ADD REPLY • link 5.8 years ago by dtejadamartinez ▴ 20

0

Entering edit mode

In that case you will need to get the Ensembl id's out of your fasta file. Use biomaRt package in R (or use BioMart on web) to get the species names. They will then need to be added back to the fasta file.

What are you planning to do with this file BTW?

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks, then I will use biomart in R as you suggest.

I'm going to do positive selection analysis (dN/dS)

ADD REPLY • link 5.8 years ago by dtejadamartinez ▴ 20

score 2 · Answer 1 · 2018-06-30

Time to summarize :)

Eric Lim showed what the ensembl ID tells us. Denise - Open Targets found the correct link to the list of ID prefixes (For some reason the link on this Help Site is wrong, I'v contacted Emily_Ensembl for this). And finally genomax give us a link to an example file one can use.

We need to create a file containing the species prefixes.

In your downloaded orthologues fasta file we have a look at the ID, which feauture type they have. In the linked example an ID looks like this:

>ENSTNIP00000017949

The last character before the digit is always a P - for protein.

Now we can modify the header by first read in our prefixes.txt, iterate over the fasta file and extract the species prefix in every header line (everything between > and P), lookup the prefix in our list and append the name to the line:

$ awk -F "\t" -v OFS="\t" 'FNR==NR {species[$1]=$2; next} {match($0, />(.+)P/, id); if (id[1] in species) {print $0, species[id[1]]} else {print}}' prefixes.txt ortho.fa > output.fa

In the output the header line now looks like this:

>ENSTNIP00000017949     Tetraodon nigroviridis (Tetraodon)

Good team play!

fin swimmer