Question

Obtain sequences to test positive selection using dN/dS statistics?

2

Entering edit mode

8.2 years ago

LauferVA 4.2k

I am working on a project to detect positive selection in several different genes.

I am working with colobus, rhinopithecus, macaca, gorilla, pan, pongo, homo, andrillus, papio, and have dogs and mice in there as well. We plan to separate these organisms into foreground and background clades as per Yang et al 2005 (Evaulation of an improved branch-site likelihood method for detecting positive selection at the molecular level). I am then going to attempt to detect positive selection using the branch site model test described in that same manuscript.

I feel comfortable with the statistical and theoretical parts of the exercise, however, I could really benefit from expert guidance on selecting sequences. Specifically, I am wanting to ask:

Beyond resources like genbank, is there a good place to get sequences on several different genes in each of these species? When doing this, is it acceptable retrieve sequences from a resource like that if you intend on publishing a positive selection paper, or is there a recommended protocol? Or are there multiple databanks?
My second question revolves around making the correct phylogenetic tree for the above organisms. I have obtained this tree from Clustal, have sanity checked it, and it makes sense with what I understand to be true, but I think that in the literature there is a standard way of making these trees. Could anyone briefly outline the best way to construct a tree for these organisms? With any luck it will validate what I already made, but I definitely want to check that.
Last, I would like to call on the experience of those who have conducted tests for positive selection based on dN/dS statistics: what are some of the most common or sinister ways of generating false positive results? Anything to double check to make sure my sequences and alignments are solid so that the statistics are measuring positive selection, and not some artifact of the sequence selection or alignment protocol?

Thank you very much for your time and help.

positive-selection evolution dN-dS • 2.9k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by LauferVA 4.2k

Ram · Accepted Answer · 2016-02-11

2

Entering edit mode

8.2 years ago

Brice Sarver ★ 3.8k

You can look at pulling down genes for sets of taxa using the PhyLoTA browser. Alternatively, you can try to get the genomes and extract the genes directly (not sure if you have genomes available for all of those). But, the best way is probably write some code to query Genbank and look for what you want, especially if you're focusing on generating a lot of single-locus datasets.
You want to thoroughly estimate a phylogenetic tree, not just use one that's probably being used as a guide tree for alignment in Clustal. This has been answered a lot on Biostars and other resources, but you can start with one of my older posts here: How to perform phylogeny analyses
The power to detect positive selection using tree-based approaches scales with the total amount of divergence among taxa in the tree. Roughly, you have more power to detect selection if your tree is 'deep' as opposed to 'shallow.' Check out some of Maria Anisimova's work and most model/PAML papers by Ziheng Yang and others.

Let me know if you have any other questions.

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Hi Brice,

Thank you very much for taking the time to help me out. Your response was great, I just want to clarify a few things:

With regard to 1), based on your response, it seems like you are saying that it is acceptable to pull down sequences from Genbank and use those manuscripts for an alignment that you trust and plan to publish on. Could you confirm that?

With regard to 3) I was not talking so much about statistical power; I understand the basics at least about the development of branch/site models as tests that (greatly) increase power, I was looking for red flags based on practical experience. An example of what I mean is: in the GWAS literature, you would never try to publish an analysis without controlling for population stratification (as a basic example), but there are many other pitfalls that those in the field know as well. I was looking for tips like that ("be sure to do YYYY so that you dont get false positives due to ZZZZZ").

Thanks again very much!

ADD REPLY • link 8.2 years ago by LauferVA 4.2k

1

Entering edit mode

1. Yes. Genbank is one of the sources where data in publications are made freely available to the public and curated. Using data from there is completely acceptable.

2. The analysis is only as good as the multiple sequence alignment and the tree, so you'll want to make sure your data is in frame, stop codons have been removed, things translate appropriately, etc. Beyond this, there's not really any way for the analysis to be messed up - you're just fitting models to your data, and you'll need to appropriately select among those models.

ADD REPLY • link 8.2 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Brice - thank you so much for your time and explanation. Based on your answer and clarification, I have accepted this answer.

ADD REPLY • link 8.2 years ago by LauferVA 4.2k

0

Entering edit mode

I just read through your other post regarding question 2) - very helpful thank you very much for the referral.

With regard to point 4.:

4. Estimate a tree under that model using BEAST/MrBayes/PAUP*/Garli. If the dataset is too large and the run will not converge, try RAxML under GTR+G.

I will estimate a tree then compare it to the literature and clustal. If all 3 of those things agree, I will run with it. Thank you!

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by LauferVA 4.2k