apply Negative Binomial Distribution (NBD) to ribosome profiling data.
1
1
Entering edit mode
8.7 years ago
xiangwulu ▴ 120

Hi,

I want to apply Negative Binomial Distribution to my ribo-seq data simulation process in order to mimic the real data.

The reason of doing this is because I want to compare with the analysis and results of real human ribo-seq data, for my other part of the work.

I have:

  • a number of RefSeq human transcripts (e.g. the NM_ ) as the source of simulation
  • read length distribution from 26bp-32bp (derived from real ribo-seq data)

The real ribo-seq data would have a character that the footprint for transcripts will be different between each sub-codon position and reflect the correct Open Reading Frame. (e.g. http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png)

I thought the distribution would mainly reflect this.

But I am very confused where to start with, e.g. how to map the distribution model into my case. I wish someone would give me some hints or advises on this, thanks.

negative-binomial-distribution ribo-seq • 2.4k views
ADD COMMENT
0
Entering edit mode

the NBD is usually applied to sum total read counts found at a gene. You can define that gene however you like, but we shouldn't be talking about codons or read lengths or even the number of transcripts. I think you're confusing several issues. Which of these numbers did you mean to simulate?

ADD REPLY
0
Entering edit mode

@karl hi, thanks for your reply. maybe I didn't explain clearly my problem, sorry about that.

The read length and no. of transcripts are secondary, there is no need to apply NBD here.

I think the codons or "the number of reads fall in different Open Reading Frame" is the question I am think about.

If the reads are randomly sampled, after align to the reference, the reads footprint could be like this:

https://www.dropbox.com/s/aia5tc5hzxbm21v/NM_01825.png?dl=0, (in SAM file, count number of alignment on each position, 3 colors means different reading frames ( +1, +2, +3))

But, ideally they are not just randomly fall across everywhere in the transcript, but they have high count on some positions, low counts or 0 counts on some other locations, e.g.

http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png

http://lapti.ucc.ie/bicoding/AT_AS/NM_000883.png

(in SAM file, count the number of alignment on each position, 3 reading frames are in 3 different plot)

ADD REPLY
0
Entering edit mode
8.6 years ago
xiangwulu ▴ 120

Sorry for the confusion in my question, I was confused for while too. Now I have figured it out.

Look at the plot: https://www.dropbox.com/s/wxrua0k52nbycm3/NM_005321.footprint.tiff?dl=0

Comparison of profiles from human ribo-seq real data and NBD sampled variates. In common, the footprint of real ribo-seq data (top plot) could have 0 in many positions, and there will be peaks and explicit (or implicit) triplet periodicity.

I want to do some tests with simulated ribo-seq data, and I want profile of simulated data looks like the real data (middle plot).

Not like this (data simulated with other RNA-seq simulator): http://https://www.dropbox.com/s/072ag1q9kwpcdqv/NM_005321.subcodon_simulated.tiff?dl=0

When I talked about the ORFs and codons, I meant that the profile of 3 separate frames in ORF would be different depending if it's translated (top plot: red, green, blue), so in the simulation, the data should be simulated separate for each individual frames (bottom plot), to reflect the real data (ideally).

ADD COMMENT

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6