Question

Why Does Rna-Seq Read Count Fit Poisson Distribution?

48

Entering edit mode

10.5 years ago

sckinta ▴ 730

Although R packages like "edgeR" "DEseq" have been frequently used for RNA-seq expression study, I am still confused that why RNA-seq read count fits Poisson Distribution or more dispersed Negative Binomial Distribution.

As far as I know, Poisson distribution is used for the number of events in specified intervals such as time period, distance, area or volume. So I can understand number of mutated sites over a fragment of DNA fit Poisson. But how does count number of reads for a given gene across samples fit this model ?

By the way, I find it hard to fit certain data to an appropriate model. It is too abstract for me. It is particularly true when it comes to the biological data like sequences. Do you have any suggestion to help me overcome “disability of imagination on model" ? Any book or journal to recommend ?

Thank you.

rna-seq modeling • 40k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 10.5 years ago by sckinta ▴ 730

2

Entering edit mode

http://arxiv.org/abs/1104.3889 this paper has the answer in great detail, can get a bit technical but in my opinion is written very well.

ADD REPLY • link 10.5 years ago by dfernan ▴ 760

score 54 · Answer 1 · 2013-10-24

54

Entering edit mode

10.5 years ago

Chris Miller 22k

Let's forget the RNAseq complication for a minute and just picture a process whereby you take the genome and choose a location at random to produce a read. This is a Poisson process. If you plot the depth of sequence along this theoretical genome, it will be a poisson distribution. [1]

The same is true for RNAseq, only instead of a genome, you're choosing a read from the transcriptome. This is made a bit more complex by the fact that some transcripts are present at higher abundance than others, but at it's core, it's still a poisson process. In fact, you can think of a negative binomial distribution as a mixture of poisson distributions with different means, which is a reasonable model for sampling from RNA transcripts with different abundances.

[1] Data from real genomes aren't exactly poisson, due to biases related to the composition of the genome, chemistry of sequencing, assembly errors in the reference genome, and inability to map to repetitive regions.

ADD COMMENT • link 10.5 years ago by Chris Miller 22k

0

Entering edit mode

That is really helpful. Thank you, Chris :P

ADD REPLY • link 10.5 years ago by sckinta ▴ 730

0

Entering edit mode

Awesome answer. I hope you are a teacher or mentor to someone out there :)

ADD REPLY • link 9.2 years ago by gaelgarcia ▴ 260

0

Entering edit mode

Hi Chris, I want to ask a following question relating to the way DESeq and EdgeR model counts using Negative Binomial.

The way you answer above, to my understanding, only applicable when we are trying to model gene counts of many gene in the same model, therefore that NB model will be a mixture of Poisson(gene A) and Poisson(gene B).

However, the way NB is used in the DESeq paper is as a mixture of Poisson(gene A in condition a) and Poisson(gene A also in condition a yet slightly different because there is never a perfect replicate). The reason the replicates for a condition could not be modeled by Poisson is because there will never be perfectly identical replicates in biology, therefore a process, even though inherently Poisson, have to be modeled by NB to account for the overdispersion arising due to the slight difference that will always be present between technical replicates.

Then, to account for the difference between biological replicates, we allow the NB model to change by varying the mean count but keep the overdispersion constant.

That is how I currently understand DESeq and edgeR, could you comment on it?

ADD REPLY • link 3.0 years ago by Vu Nguyen • 0

score 29 · Answer 2 · 2013-10-24

29

Entering edit mode

10.5 years ago

Michele Busby ★ 2.2k

This is a good question and one you really need to understand to interpret your data.

The way I think about it is this:

If you are sampling something by a count, say the selection of blue marbles out of a bag of different colored marbles, the count for any given color of marble will fluctuate around the true value.

For example, if the true proportion of marbles that are blue in the bag is 4% and you select 100 marbles sometimes you will get 4, sometimes you will get 5 or 3. If you do that an infinite number of times and the bag is big enough the distribution you will get is a Poisson distribution. (This is the Poisson approximation to the binomial).

We use a Poisson distribution to describe RNA Seq data because we are selecting species of RNA out of a pool of RNA molecules. So instead of a blue and red marbles you have SUMO2 and BRCA1 RNA.

A poisson distribution is a one parameter distribution. The mean of the distribution is always equal to the variance. It can only describe the counting noise and nothing else.

However, counting noise won't be the only source of variance in your data. You will also have biological noise and technical artifact. Some genes may fluctuate more or less because of their inherent nature (e.g. heat shock genes have high variance in biological replicates because they bounce around if someone puts the heat up). If you measure the expression of these genes (and really all genes') in multiple replicates the variance of those measurements will be higher than the mean. So you don't want to use a Poisson distribution to describe that.

So what to do?

A negative binomial distribution can be thought of in lots of different ways, but one thing that happens is that if you sample from a gamma distribution using counting you get a negative binomial distribution. A gamma distribution kind of in the middle of a lognormal and a normal distribution.

So using a negative binomial distribution assumes that the noise from biological and technical variance is roughly described by a gamma distribution but then also accounts for the sampling noise.

I wrote this thing here that explains variance a little more and how to think about it. Maybe you'll find it useful.

http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for

ADD COMMENT • link 10.5 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

I think the marble metaphor should follow the hypergeometric distribution rather than Poisson distribution, isn't it? Maybe that is the reason you explained as " This is the Poisson approximation to the binomial". Chris above gave a more acceptable example "you take the genome and choose a location at random to produce a read", that meets the Poisson definition "number of events in specified intervals such as time period, distance, area or volume".

The variance analysis in the link is very helpful. Thank you very much. I should keep in mind that "Poisson variance is the inherent uncertainty that is present in any measurement made where something is sampled and counted".

One more thing to ask, you mentioned "sample from a gamma distribution using counting", what follows a gamma distribution ? I thought in NB distribution, the mean/variance of Poisson model follows the gamma distribution. Here you said "biological and technical variance is roughly described by a gamma distribution ". There are intersection among Poission Variance, biological variance and technical variance ??

ADD REPLY • link 10.5 years ago by sckinta ▴ 730

3

Entering edit mode

The sampling from the marble bag is hypogeometric if you don’t have replacement but is approximately binomial if the number of blue marbles is small relative to the size of the bag. The counts approximate a Poisson distribution but the Poisson approximation to the binomial (see Mathematical Statisicts by Miller and Miller, p 186). So I guess that should say Poisson approximation to the binomial approximation to the hypogeometric. For RNA Seq the number of reads you sequence relative to the number of RNA molecules in the sample is usually miniscule and the approximations are pretty robust when you are getting into sample sizes of millions. You can try it by throwing random numbers in R or Matlab and seeing how much you have to sample before they break. Empirically, the Poisson model fits very well with RNA Seq data and in very good complex replicates on the same sequencing run the variance you see in scatter plots almost follows the expectation of a Poisson sampling (i.e. there isn’t much extra technical variance and the main problem is sampling. These assumptions, however, do break down wildly in over-sequenced libraries because the PCR amplification does strange things and then the Poisson model becomes a pretty crappy model. So you have to look at your data. Plotting the Poisson on top of the technical replicates is the easiest way to see if something is amiss. I was over generalizing when I said counts ALWAYS follow Poisson by they usually follow Poisson so if you are building a model that is usually a good place to start. As for the gamma, I think what they are modeling there is the expression of a single gene in biological replicates. This distribution (the true expression on a continuous scale of measurement) is traditionally considered to be lognormal (from microarray and qPCR data) though this is difficult to confirm empirically. Theoretically, a mixture of a lognormal and a Poisson is a Poisson lognormal, but using that require integration which makes things run slow without really improving performance over approximating the lognormal with the gamma and using the negative binomial distribution.
Anyway, when I was working on this stuff I just assumed a normal distribution and in simulated data it worked pretty much as well as the negative binomial. In RNA Seq there are some biological constraints where you don’t really get genes expressed with really long tails. There is a wide dynamic range over all the genes, but most genes live in a pretty small portion of that dynamic range. Non-parametric statistics are the most theoretically correct way to go anyway.

ADD REPLY • link 10.5 years ago by Michele Busby ★ 2.2k

Ram · Answer 3 · 2014-06-21

16

Entering edit mode

9.8 years ago

Ann ★ 2.4k

Hi all,

This may help:

Videos by Rafael Irizarry from his EdX class on data analysis for genomics:

and

The first one explains why RNA-seq is like Poisson, using an example from playing the lottery. The second follows up on that idea.

-Ann

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Ann ★ 2.4k

0

Entering edit mode

These videos really helped. Thanks Ann!

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by jollier.liu • 0

0

Entering edit mode

Rafa Izarry <3

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.2 years ago by gaelgarcia ▴ 260

0

Entering edit mode

Which edx course is that?

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 8.7 years ago by scchess ▴ 640

0

Entering edit mode

Are these videos still available?

ADD REPLY • link 2.5 years ago by xiaoleiusc ▴ 140

Ram · Answer 4 · 2013-10-24

11

Entering edit mode

10.5 years ago

Jeremy Leipzig 22k

I think revisiting Mark Robinson's answer from What Makes One Probability Distribution Better For Rna-Seq Than Another? is fitting here (no pun intended). He implies that negative binominal fits better because variability between biological replicates is more than Poisson can account for.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 10.5 years ago by Jeremy Leipzig 22k

0

Entering edit mode

+1 for negative "binominal" :

ADD REPLY • link 10.5 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Yes, I read that post before I posted this question. I could understand why NB model fits better than Poisson. I am just confused by models---How to choose a model to fit our data. I think it is more general than that post.

ADD REPLY • link 10.5 years ago by sckinta ▴ 730

0

Entering edit mode

It's a dark art.

ADD REPLY • link 10.5 years ago by Michele Busby ★ 2.2k

score 1 · Answer 5 · 2019-05-19

if you think a read as a point, then in theory, for a region L, the probability for a read fall into region L is: L/G, here G is the genome size. if the depth is evenly distributed, which says that the the genomes of a batch of cells were randomly broken and the sequencing process was uniform, then the expected reads num fall in this region L is: L/G * N, N is the total reads num. based on this assumption, then you can model the distribution of Read count, RC, as a Possion probability distribution. hope this helps.