9.9 years ago
@Gjuggler1227
There really is no better advice than as posted above: use PAML!
To expand a bit: PAML is the accepted standard for codon-level selection analyses and is used in a sizable proportion of publications that mention either the words 'evolution' or 'selection'. The codeml
sub-program is what you'd want to use: it supports nearly everything you might want to test with respect to estimating dN/dS or detecting / quantifying purifying or positive selection in genes using codon alignments.
Common tasks performed with codeml
involve:
- Estimating a single dN/dS ratio (roughly equivalent to Ka/Ks) for a single alignment of coding sequences. This gives a ballpark figure for the amount of protein-level selective constraint experienced by a gene. But be warned, most genes do not evolve homogenously along their length -- different domains (extracellular vs. intracellular, for example) may experience drastically different levels of constraint.
- Identifying different dN/dS ratios in different branches of a phylogenetic tree
- Identifying genes with individual sites or regions under positive selection, sometimes in specific branches of a tree.
PAML and codeml are complicated (and sometimes finicky) programs, with dozens of analyses available depending on how you configure them. In order to use it well, you'll really want to either: - Read the documentation - Read one or two of Ziheng Yang's many publications about the models PAML implements and the applications it enables - Look at the examples in the PAML download package.
Regarding the 'true relation between conservation and selection': the codon models implemented by PAML are more sensible than amino acid-based measures for measuring selection pressures acting on proteins because they offer built-in 'correction' for synonymous mutation rates (as long as synonymous sites are evolving neutrally, which seems to be largely the case with smallish population sizes).
Amino-acid based measures of evolutionary conservation are not so well normalized, because the rate of amino acid change is the result of a combination of mutation rate _plus_ selection for conservation of protein structure and function. Codon models, on the other hand, use DNA alignments of coding regions and an explicit model incorporating the genetic code to estimate the amount of natural selection for or against protein-level changes. To paraphrase Ziheng Yang's Computational Molecular Evolution: dN/dS represents the ratio between the rate at which a protein is evolving to the rate at which it _would_ be evolving were it a non-structured, non-functional string of DNA characters.
Also.. If you wish to run dN/dS on 100-1000s of proteins me and a friend have created our own Python script that implements the the Nei-Gojobori algorithm (but uses approximate multiple-substitutions correction rather than exact) on large-scale protein sequence pair data (i.e. one pair of sequences per protein). it implements both whole-protein and sliding-window modes. Given two million base-pairs of sequence data it will finish the job in <15 seconds with ~1% error vs. MATLAB. If you ask, I can properly document it and publish on GitHub.
https://github.com/a1ultima/hpcleap_dnds Specifically you need: - [dnds.py][1] For more details, either [read this][1], or see the comment below. [1]: https://github.com/a1ultima/hpcleap_dnds/blob/master/README.md
Hi, can you send me that script too? Or publish on Github
Hi! Have you published the python script on GitHub? Or could you please send me the python? Thank you so much!
Thank you for the detailed answer. It helped me a lot. Could you send me the code, too, please? My email is kitaekim077@gmail.com. I will appreciate it.
Hi @a1ultima, I know I am little late to join this ticket, but did you ever make a github page for your program. If not, could you please send it to me through email, salamzader21@gmail.com Much appreciated.
Hi a1ultima, I know im very late to ask you the python script. Could you send me the link of GitHub or via email to vharshavardhanan@gmail.com. I will appreciate it. Thank you.
For all who have asked for the dN/dS Python file, it is finally on GitHub, I really do apologize for how long I have taken to get back to some of you. This Python file I am sharing is part of a bigger pipeline, but I hope I have documented it well enough now so that you can just take this part and play with it yourself:
https://github.com/a1ultima/hpcleap_dnds
Specifically you need:
Optionally:
We've made a GUI (in the form of a webpage: vg-genes.html), that works with a local Python server (start_server.py), so that means you don't have to mess around with the dnds.py script itself. I will post a URL to the GUI as soon as it is hosted, if you cannot wait, then read the README.md to set up the local python server yourself and also how to run the GUI. Though, it currently has restrictions: can only accept VectorBase gene ids as input (it needs to do a REST call to the vectorbase.org to retrieve sequence data). But later we'll try and make it more flexible. If you want to analyze pairs of protein sequences outside of VectorBase, then focus only on dnds.py (above).
For very fast computation dnds.py needs to work with changes.py, in that it imports changes.py in the same directory. But you don't need to really worry about it, since we decided the pre-pickled computations data was worth uploading to GitHub anyway, i.e. you will only ever have to read the changes.py code if you decide to delete, the pre-pickled computations.
If your pair of input sequences that are not already aligned, you will be faced with a performance bottleneck in the alignment process, a step that is necessary to calculate dN/dS via dnds.py. By default, dnds.py will try to align the inputs by importing align.py, but I have tried to write informative comments in dnds.py to help you turn this step off. The catch is that you would then need to have the inputs codon aligned prior to running dnds.py. Apart from boosted performance, another issue with keeping the align.py step running is that it needs to import an external Python module called Biopython (
import Bio
). If you do not want to install Biopython, then make sure you align the inputs yourself, turn off the alignment part in dnds.py, and run dnds.py.I am not sure this message will be seen by any of those who asked for it, will this pop up in their notifications? I will for now assume it will.
Hi a1ultima,
Hello, thanks for your tutorial. Could you kindly send me the R code as well, and current link for KaKs calculator. Email: ronoevan@gmail.com Thanks, Evans
I am really sorry about the embarassingly long wait before giving the R script for calculating error-bars for your dN/dS analyses.
For anybody who still needs it, I've put the script on GitHub with instructions on what you need to do to get it to work in the README.md:
https://github.com/a1ultima/kaks_error_bars
Assalm o alikum everyone,
I'm using ka ks calculator for ka ks ratio. But i'm too confused either it is working for multiple sequence or only for a pair of sequences ????
Hi Assalm, I've only ever got it to work with pairs