Hello,

I have an alignment between 3 species at the codon level and would like to calculate theta. The "ape" part is easy:

`a <- read.FASTA("M715_870002501/aligned_nt.fasta"`

`> class(a)

[1] "DNAbin"

methods(class = "haplotype")

[1] [.haplotypeplot.haplotypeprint.haplotypesort.haplotype`

` Non-visible functions are asterisked

class(a)

[1] "DNAbin"

s <- length(seg.sites(a))

s

[1] 547`

However, I am not sure how to import this DNAbin format into pegas for the `theta.s()`

function. The data function won't work:

```
> data(a)
Warning message:
In data(a) : data set \u2018a\u2019 not found
```

According to pegas docs, for theta.s:

```
Description
This function computes the population parameter THETA using the number of segregating sites
s - in a sample of
n - DNA sequences.
Usage
theta.s(s, n, variance = FALSE)
Arguments
s - a numeric giving the number of segregating sites.
n - a numeric giving the number of sequences.
variance - a logical indicating whether the variance of the estimated THETA should be
returned (TRUE), the default being FALSE.
```

Any ideas? I feel like solution is simple and I am missing it... do I just do theta.s(s, 3) ? Since there are 3 DNA sequences?

Adrian

Thanks a lot for the help! By the way, how do I get it per nucleotide, that was my next question. I need to compare 2 estimators and one of them spits out theta per nucleotide so it would make sense to normalize this one relative to length as well. This DNAbin format is driving me crazy :) I know it is a matter of just dividing by length of the alignment, but I cannot get the length into a variable. Also, I need to divide the variance by length as well, correct?

Just the normal -- , ncol (or dim). You should check how the other estimator deals with missing data (if you have any) if you wan to make a fair comparison.

I am looking at complete ORFs so there should not be any missing data. Oddly, the only way to get alignment length is

`length(nchar(as.character(a$ECU01_1130)))`

, which is pretty messy and inconvinient since one needs to supply at least one of the sequence names.`> a 3 DNA sequences in binary format stored in a list.`

`All sequences of same length: 1554`

`Labels: orf00001_C112950_F_131_1612 NBO_26g0012 ECU01_1130`

`Base composition:

a c g t

0.337 0.172 0.216 0.275

You are using read.FASTA to get the files in, which returns a list of characters . If you use read.dna( ... , format='"fasta") you should get a matrix that nrow/ncol/dim and the rest should work on.

In the ape manual, read.FASTA is given as an example in the read.dna function, which made me think read.FASTA is a shortcut for format="fasta". Can't believe this was the problem. Thank you so much for your help!