Question

How to align a protein set to a genome?

1

Entering edit mode

8.4 years ago

Dan ▴ 530

Hi,

I have a new genome assembly and I want to align the protein sequences of the original assembly against it. What is the best tool for this job?

protein genome alignment prediction annotation • 5.1k views

ADD COMMENT • link updated 8.4 years ago by jomaco ▴ 200 • written 8.4 years ago by Dan ▴ 530

Ram · Answer 1 · 2015-11-18

4

Entering edit mode

8.4 years ago

Juke34 8.5k

Hi, several suggestions:

If you want an approximated alignments you can use Pmatch or tblastn.
If you want something precise, you can use exonerate or Genewise that give splice-aware alignments.

This publication reviews the performance of 7 tools doing spliced alignments from proteins (They look also at 12 tools doing DNA alignments):
Hiroaki Iwata and Osamu Gotoh Nucleic Acids Res. 2012 Nov; 40(20): e161. doi: 10.1093/nar/gks708

The second way is more time consuming if you use these tools directly. Often the two steps are coupled. The first step is used to define chunks of genome that will be send to the second step tools (e.g. within Maker and Ensembl annotation pipelines).

Cheers

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Juke34 8.5k

1

Entering edit mode

I think you meant tblastn and not Blasx

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

0

Entering edit mode

Right, I will update the post !

ADD REPLY • link 5.5 years ago by Juke34 8.5k

0

Entering edit mode

May I also ask here if Promer (MUMmer) has an option of aligning a proteome to a genome? According to what i see it uses only nucleotide sequences, am I missing smth?

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

1

Entering edit mode

In the MUMer4 publication they state "It is not restricted to DNA and can also align protein sequences". It is not clearly said in the manual but it looks you can proteome as input of Promer. This approach do not provide splice-aware alignment.

Another tool really fast would be PSimScan.

Otherwise if you look for splice aware alignment you could have a look at this publication they show performance of 7 different tools for protein alignments:
Hiroaki Iwata and Osamu Gotoh Nucleic Acids Res. 2012 Nov; 40(20): e161. doi: 10.1093/nar/gks708

ADD REPLY • link 5.5 years ago by Juke34 8.5k

0

Entering edit mode

I tried to align proteome to a genome using promer, but it treats proteins as IUPAC code for nt and turns to N all the letters that it does not recognize... Anyway I ll try to look more into their publication, thanks. PSimScan looks a good tool too! I just need an "approximate" alignment at this stage, but will look into the publication that you mentioned for my future reference, many many thanks!

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

0

Entering edit mode

Here is what they say on the MUMmer4.x README:

promer is for the protein level, all-vs-all comparison of nucleotide sequences contained in multi-FastA data files. The nucleotide input files are translated in all 6 reading frames and then aligned to one another via the same methods as nucmer.

I think it can only deal with nucleotides.

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

0

Entering edit mode

That's pity they don't check if the input is AA or DNA and skip the six frame translation if it is already protein. You should create an issue and ask if it could be implemented in a future version.

ADD REPLY • link 5.5 years ago by Juke34 8.5k

0

Entering edit mode

You're right. I will.

ADD REPLY • link 5.5 years ago by alslonik ▴ 310

score 1 · Answer 2 · 2015-11-18

I would use blat or exonerate. Blat is better for more closely related species and the nice thing is that both will produce a blast table for easy parsing (though, with exonerate you have to use the 'roll-your-own' with a custom string, which I could share). Exonerate is used by Maker for protein alignments and it has a lot more options that allow you to control the splicing and intron modeling, codon alignment, etc. Blat is a lot faster, so that is a trade-off to consider.

score 0 · Answer 3 · 2015-12-11

If you wish to align those proteins to a reference assembly you could use the exonerate (http://www.ebi.ac.uk/~guy/exonerate/) protein2genome model which models introns. I used this when I wanted to align proteins from the TAIR10 database to our reference genome. You would also probably want to split the file into considerably smaller chunks so that many faster individual alignments can be carried out before the results are merged - this way the alignment as a whole will be much quicker.