14 months ago
N.B., while I was writing this, you replied with a link to a paper doing something completely different, so this isn't likely relevant.
As https://www.biostars.org/u/191/ said, context is important.
In these studies, the goal is prenatal diagnosis of an autosomal recessive disorder. In short, a certain amount of fetal DNA is commonly found in the circulatory system of the mother (the amount changes over the duration of the pregnancy). Since the mother, in this case, is a carrier for the disorder, we need to somehow estimate the probability that the a possible over-abundance of a mutant allele observed in her blood is significant, since the overabundance would come from a homozygous mutant child.
So, firstly, these are not RPKMs, but raw counts (at least in the ones I 've seen, namely here).
r, then, is the ratio of fetal to maternal DNA that we get from the blood draw. We measure some total number of reads informatively covering a site,
Nt, comprised of mutant (
Nm) and wild-type (
Nw) reads. If the fetus is homozygous mutant, then we'll have an
Nm-Nw=r*Nt, since the entire fetal contribution will be in
Nm. If, on the other hand, the fetus is heterozygous, then reads originating from it will be (more or less) evenly split between the two counts, and
Nm-Nw=0. We can measure Nm-Nw, but in order to get a statistic, we need to make that a Z-score, by dividing by its standard deviation.
So how, then, might we arrive at a standard deviation. Recall that we have raw counts (not RPKMs), which means that the technical variance is the same as the mean (we have Poisson variance). The standard deviation is the square root of the variance, so:
sd(Nm) = sqrt(E(Nm)) = sqrt(0.5*Nt+0.5*r) in the homozygous case and
sqrt(0.5*Nt) (E(Nm) is the expected value and I'll leave demonstrating that
(Nt+r)/2 is the expected value as an exercise for you).
sd(Nw) = sqrt(E(Nw)) = sqrt(0.5*Nt-0.5*r) in the homozygous case and also
sqrt(0.5*Nt) in the heterozygous case.
We then want the pooled deviation of the difference, which we can get by simply taking the square root of the pooled variance. The pooled variance ends up just being
Nt, since you just take the sum of the variances (i.e., the sum of the squared standard deviations). Thus, the denominator is sqrt(Nt).
Update: As mentioned at the top, it seems that you 're talking about different papers (thanks for the links) where they actually do use RPKMs. The logic is generally the same as that presented above and it should be noted that the denominator should be
sqrt(RPKM_A + r*RPKM_B), since that's vaguely similar to
sqrt(Nt). This is, however, a terrible way of doing things and should never be done. They were doing pair-wise comparisons, but one should really take a couple different dishes of ES cells with difference passages and then use them as replicates to actually gauge what the variance is.
Update2: By "Poisson distribution", they mean that the RPKM measurements themselves have Poisson variance. This is simply because the technical variance of the original counts are Poisson and an RPKM is just a transformation of the original count.