Biostar Beta. Not for public use.
What parameters should be used for Seq-Gen?
Entering edit mode
4.2 years ago
rzmn • 10

I am CS major with little biology background. I am doing some experiments, and at some point, I need to simulate DNA alignments along a phylogenetic tree (which is produced under Yule process using r8s program). I was able to produce "some" sequence using Seq-Gen, Seq-Gen manual, with the following parameters from that tree:

seq-gen -o p -m GTR -i 0.01 -f 0.3 0.2 0.2 0.3 -s 0.5 -z <some_seed> -l 2000


  • -o"output_file_format"
  • -m"MODEL"

The problem is that when I use RAxML to obtain a maximum likelihood tree from this sequence, the resulting tree is very different from the one from which the sequence data was generated, namely their RF distance is almost maximum possible. I suspect there is something wrong with my parameters of Seq-Gen. Probably they are not biologically sensible, or I need to use more parameters like -r option to specify substitution rates? (If that matters, I used GTRGAMMA model for RAxML)


Somehow I realized when I change branch lengths of the tree, the problem is mitigated. So my #0 question is how to set branch lengths of a (say ultrametric) phylogeny such that it biologically makes sense? Is there a standard way of doing this? Any software is available for this?

Entering edit mode

Just as a sanity check, could you generate a pairwise distance matrix for the sequences that were generated? Plus remember that seq-gen is probabilistic and a stochastic process may generate sequences that are different from the coalescent tree by pure chance.

Entering edit mode
16 months ago
Brice Sarver ♦ 2.6k
United States

Something to be aware of here that's unclear from your question: you give seq-gen a tree (with branches in units of substitutions/site), and then you can simulate sequences along that tree. You're doing this, right? As your update suggests, you are using the -s option to scale your branch lengths before substitution. You probably don't want to do this if you are expecting the -exact- same tree to come out the other end.

When I've used seq-gen for simulation studies, I often find a paper for something interesting and directly specify the rate matrix itself, including any gamma-distributed rate heterogeneity (the -r/a/g options). This ensures that my values are biologically realistic. Make sure you know what data you're getting your parameters from; mitochondrial sequences, for example, evolve more quickly than nuclear sequences.

I've simulated trees under Yule and Birth-Death processes and simulated sequences on these trees, too, and you might want to consider using Tanja Stadler's TreeSim (in R) to make sure you condition appropriately (taxa and/or age) for your speciation/extinction rates, as well.

Also, you need to remove the space between -o and the argument, and the call is something like:

seq-gen [arguments] < [tree.phy] > [sequences.nex]
Entering edit mode

I am using 'simulate' command in r8s to construct a tree under Yule process. This generates a topology with all branch lengths the same. Then I used seq-gen to simulate sequences along this tree. Then the tree constructed from this sequence was not similar to the initial topology. What I did to lessen the problem was multiply branch lengths by random numbers between 0 and 1 'manually', and it helped. so now I have two (perhaps stupid) questions: 1- What is the standard practice to modify branch lengths when they are the same? 2- Is specifying -a and -r options in seq-gen eliminates the need to modify branch lengths in my case?


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1