I am working on a project to detect positive selection in several different genes.
I am working with colobus, rhinopithecus, macaca, gorilla, pan, pongo, homo, andrillus, papio, and have dogs and mice in there as well. We plan to separate these organisms into foreground and background clades as per Yang et al 2005 (Evaulation of an improved branch-site likelihood method for detecting positive selection at the molecular level). I am then going to attempt to detect positive selection using the branch site model test described in that same manuscript.
I feel comfortable with the statistical and theoretical parts of the exercise, however, I could really benefit from expert guidance on selecting sequences. Specifically, I am wanting to ask:
1) Beyond resources like genbank, is there a good place to get sequences on several different genes in each of these species? When doing this, is it acceptable retrieve sequences from a resource like that if you intend on publishing a positive selection paper, or is there a recommended protocol? Or are there multiple databanks?
2) My second question revolves around making the correct phylogenetic tree for the above organisms. I have obtained this tree from Clustal, have sanity checked it, and it makes sense with what I understand to be true, but I think that in the literature there is a standard way of making these trees. Could anyone briefly outline the best way to construct a tree for these organisms? With any luck it will validate what I already made, but I definitely want to check that.
3) Last, I would like to call on the experience of those who have conducted tests for positive selection based on dN/dS statistics: what are some of the most common or sinister ways of generating false positive results? Anything to double check to make sure my sequences and alignments are solid so that the statistics are measuring positive selection, and not some artifact of the sequence selection or alignment protocol?
Thank you very much for your time and help.