Biostar Beta. Not for public use.
Pegas import fasta and calculate theta
0
Entering edit mode
14 months ago

Hello,

I have an alignment between 3 species at the codon level and would like to calculate theta. The "ape" part is easy:

`> class(a)
[1] "DNAbin"

methods(class = "haplotype")
[1] [.haplotype plot.haplotype print.haplotype sort.haplotype`

` Non-visible functions are asterisked

class(a)
[1] "DNAbin"
s <- length(seg.sites(a))
s
[1] 547`

However, I am not sure how to import this DNAbin format into pegas for the theta.s() function. The data function won't work:

> data(a) Warning message: In data(a) : data set \u2018a\u2019 not found

According to pegas docs, for theta.s:

Description This function computes the population parameter THETA using the number of segregating sites s - in a sample of n - DNA sequences. Usage theta.s(s, n, variance = FALSE) Arguments s - a numeric giving the number of segregating sites. n - a numeric giving the number of sequences. variance - a logical indicating whether the variance of the estimated THETA should be returned (TRUE), the default being FALSE.

Any ideas? I feel like solution is simple and I am missing it... do I just do theta.s(s, 3) ? Since there are 3 DNA sequences?

1
Entering edit mode
4.8 years ago
David W 4.7k
New Zealand
Do I just do theta.s(s, 3) ? Since there are 3 DNA sequences

Yup. From ?theta.s

s    a numeric giving the number of segregating sites.

n    a numeric giving the number of sequences.

variance    a logical indicating whether the variance of the estimated THETA should be returned (TRUE), the default being FALSE.

You can make it a function on a dna bin:

theta_watterson <- function(d) length(seg.sites(d)) / sum(1/1:(nrow(d)-1))

Note, both approaches will give you Ne * mutation rate per _gene_ , not per nucleotide.

0
Entering edit mode

Thanks a lot for the help! By the way, how do I get it per nucleotide, that was my next question. I need to compare 2 estimators and one of them spits out theta per nucleotide so it would make sense to normalize this one relative to length as well. This DNAbin format is driving me crazy :) I know it is a matter of just dividing by length of the alignment, but I cannot get the length into a variable. Also, I need to divide the variance by length as well, correct?

0
Entering edit mode

Just the normal -- , ncol (or dim). You should check how the other estimator deals with missing data (if you have any) if you wan to make a fair comparison.

0
Entering edit mode

I am looking at complete ORFs so there should not be any missing data. Oddly, the only way to get alignment length is length(nchar(as.character(a\$ECU01_1130))), which is pretty messy and inconvinient since one needs to supply at least one of the sequence names.

> a 3 DNA sequences in binary format stored in a list.

All sequences of same length: 1554

Labels: orf00001_C112950_F_131_1612 NBO_26g0012 ECU01_1130

`Base composition:
a c g t
0.337 0.172 0.216 0.275

length(nchar(as.character(a\$ECU01_1130)))
[1] 1554
ncol(a)
NULL
dim(a)
NULL
`

0
Entering edit mode

You are using read.FASTA to get the files in, which returns a list of characters . If you use read.dna( ... , format='"fasta") you should get a matrix that nrow/ncol/dim and the rest should work on.