I am working on a project to detect positive selection in several different genes.
I am working with colobus, rhinopithecus, macaca, gorilla, pan, pongo, homo, andrillus, papio, and have dogs and mice in there as well. We plan to separate these organisms into foreground and background clades as per Yang et al 2005 (Evaulation of an improved branch-site likelihood method for detecting positive selection at the molecular level). I am then going to attempt to detect positive selection using the branch site model test described in that same manuscript.
I feel comfortable with the statistical and theoretical parts of the exercise, however, I could really benefit from expert guidance on selecting sequences. Specifically, I am wanting to ask:
- Beyond resources like genbank, is there a good place to get sequences on several different genes in each of these species? When doing this, is it acceptable retrieve sequences from a resource like that if you intend on publishing a positive selection paper, or is there a recommended protocol? Or are there multiple databanks?
- My second question revolves around making the correct phylogenetic tree for the above organisms. I have obtained this tree from Clustal, have sanity checked it, and it makes sense with what I understand to be true, but I think that in the literature there is a standard way of making these trees. Could anyone briefly outline the best way to construct a tree for these organisms? With any luck it will validate what I already made, but I definitely want to check that.
- Last, I would like to call on the experience of those who have conducted tests for positive selection based on dN/dS statistics: what are some of the most common or sinister ways of generating false positive results? Anything to double check to make sure my sequences and alignments are solid so that the statistics are measuring positive selection, and not some artifact of the sequence selection or alignment protocol?
Thank you very much for your time and help.
Hi Brice,
Thank you very much for taking the time to help me out. Your response was great, I just want to clarify a few things:
With regard to 1), based on your response, it seems like you are saying that it is acceptable to pull down sequences from Genbank and use those manuscripts for an alignment that you trust and plan to publish on. Could you confirm that?
With regard to 3) I was not talking so much about statistical power; I understand the basics at least about the development of branch/site models as tests that (greatly) increase power, I was looking for red flags based on practical experience. An example of what I mean is: in the GWAS literature, you would never try to publish an analysis without controlling for population stratification (as a basic example), but there are many other pitfalls that those in the field know as well. I was looking for tips like that ("be sure to do YYYY so that you dont get false positives due to ZZZZZ").
Thanks again very much!
1. Yes. Genbank is one of the sources where data in publications are made freely available to the public and curated. Using data from there is completely acceptable.
2. The analysis is only as good as the multiple sequence alignment and the tree, so you'll want to make sure your data is in frame, stop codons have been removed, things translate appropriately, etc. Beyond this, there's not really any way for the analysis to be messed up - you're just fitting models to your data, and you'll need to appropriately select among those models.
Brice - thank you so much for your time and explanation. Based on your answer and clarification, I have accepted this answer.
I just read through your other post regarding question 2) - very helpful thank you very much for the referral.
With regard to point 4.:
4. Estimate a tree under that model using BEAST/MrBayes/PAUP*/Garli. If the dataset is too large and the run will not converge, try RAxML under GTR+G.
I will estimate a tree then compare it to the literature and clustal. If all 3 of those things agree, I will run with it. Thank you!