Biostar Beta. Not for public use.
What parameters should be used for Seq-Gen?
1
Entering edit mode
4.2 years ago
rzmn • 10

I am CS major with little biology background. I am doing some experiments, and at some point, I need to simulate DNA alignments along a phylogenetic tree (which is produced under Yule process using r8s program). I was able to produce "some" sequence using Seq-Gen, Seq-Gen manual, with the following parameters from that tree:

seq-gen -o p -m GTR -i 0.01 -f 0.3 0.2 0.2 0.3 -s 0.5 -z <some_seed> -l 2000

where

• -o"output_file_format"
• -m"MODEL"
• -i"PROPORTION_INVARIABLE"
• -f"STATE_FREQUENCIES"
• -s"BRANCH_LENGTH_SCALE"
• -z"RANDOM_NUMBER_SEED"
• -l"SEQUENCE_LENGTH"

The problem is that when I use RAxML to obtain a maximum likelihood tree from this sequence, the resulting tree is very different from the one from which the sequence data was generated, namely their RF distance is almost maximum possible. I suspect there is something wrong with my parameters of Seq-Gen. Probably they are not biologically sensible, or I need to use more parameters like -r option to specify substitution rates? (If that matters, I used GTRGAMMA model for RAxML)

Update:

Somehow I realized when I change branch lengths of the tree, the problem is mitigated. So my #0 question is how to set branch lengths of a (say ultrametric) phylogeny such that it biologically makes sense? Is there a standard way of doing this? Any software is available for this?

0
Entering edit mode

Just as a sanity check, could you generate a pairwise distance matrix for the sequences that were generated? Plus remember that seq-gen is probabilistic and a stochastic process may generate sequences that are different from the coalescent tree by pure chance.

1
Entering edit mode
16 months ago
Brice Sarver ♦ 2.6k
United States

Something to be aware of here that's unclear from your question: you give seq-gen a tree (with branches in units of substitutions/site), and then you can simulate sequences along that tree. You're doing this, right? As your update suggests, you are using the -s option to scale your branch lengths before substitution. You probably don't want to do this if you are expecting the -exact- same tree to come out the other end.

When I've used seq-gen for simulation studies, I often find a paper for something interesting and directly specify the rate matrix itself, including any gamma-distributed rate heterogeneity (the -r/a/g options). This ensures that my values are biologically realistic. Make sure you know what data you're getting your parameters from; mitochondrial sequences, for example, evolve more quickly than nuclear sequences.

I've simulated trees under Yule and Birth-Death processes and simulated sequences on these trees, too, and you might want to consider using Tanja Stadler's TreeSim (in R) to make sure you condition appropriately (taxa and/or age) for your speciation/extinction rates, as well.

Also, you need to remove the space between -o and the argument, and the call is something like:

seq-gen [arguments] < [tree.phy] > [sequences.nex]