Question

Tool:Rlsim, A Package For Simulating Rna-Seq Library Preparation With Parameter Estimation

14

Entering edit mode

11.0 years ago

Botond Sipos ★ 1.7k

What is the rlsim package?

The rlsim package is a collection of tools for simulating RNA-seq library construction, aiming to reproduce the most important factors which are known to introduce significant biases in the currently used protocols: hexamer priming, PCR amplification and size selection.

It allows for a systematic exploration of the effects of the individual biasing factors and their interactions on downstream applications by simulating data under a variety of parameter sets.

The implicit simulation model implemented in the main tool (rlsim) is inspired by actual library preparation protocols and it is more general than the models used by the bias correction methods hence it allows for a fair assessment of their performance.

Although the simulation model was kept as simple as possible in order to aid usability, it still has too many parameters to be inferred from data produced by standard RNA-seq experiments. However, simulating datasets with properties similar to specific datasets is often useful. To address this, the package provides a tool (effest) implementing simple approaches for estimating the parameters which can be recovered from standard RNA-seq data (GC-dependent amplification efficiencies, fragment size distribution, relative expression levels).

The latest release and the package source is available from the rlsim GitHub repository: https://github.com/sbotond/rlsim

Citing the rlsim package

The rlsim manuscript is now on arXiv, with the analysis pipeline at https://github.com/sbotond/paper-rlsim:

Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman (2013) Realistic simulations reveal extensive sample-specificity of RNA-seq biases arXiv:1308.3172

Getting more help

Please consult the package documentation for more help on the tools and the technical background. Also feel free to ask questions on BioStars, I will monitor the rlsim tag.

simulation rna-seq pcr illumina • 6.1k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 11.0 years ago by Botond Sipos ★ 1.7k

score 2 · Answer 1 · 2013-04-08

2

Entering edit mode

11.0 years ago

Istvan Albert 100k

Great stuff, we need tools like this so much!

I firmly believe that just about any bioinformatics training needs to start with teaching people how to generate data with certain properties.

Only knowing what is inside of a simulated data allows one to actually be able to understand how well some other methodology performs.

ADD COMMENT • link 11.0 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks! I hope that it will be useful for the RNA-seq community.

ADD REPLY • link 11.0 years ago by Botond Sipos ★ 1.7k

score 1 · Answer 2 · 2013-04-16

This is a really interesting tool. I have a couple of questions concerning PCR duplicates and strand oriented libraries. In case of PCR duplicates, can I simulate libraries including different percentages of PCR duplicates? Normally I observe high duplication rates and I would understand main effects in gene expression. Since real data comes with duplicates, can I infer PCR duplicates (from SAM files) and use such values in rlsim? In case of strand oriented libraries, which parameter should I use? A last question concerns the number of fragment to simulate. My aim is to simulate a full HiSeq 2000 lane (single or paired-end reads). How big the library should be?

score 1 · Answer 3 · 2013-04-17

@Ernesto: That is quite a complex question, deserving multiple answers:

1) Regarding PCR duplicates:

First, let’s consider the major factors influencing the number of PCR duplicates in a simulated (or real) RNA-seq experiment:

The amount of starting material - or in other words the absolute transcript-wise expression levels. This cannot be estimated from standard RNA-seq data, so the effest tool will give you only rough estimates of the relative expression levels.
The PCR amplification efficiencies - effest will give you a estimate of this as a function of the fragment GC content, assuming an average efficiency of 0.87.
The number of sampled reads (which is usually fixed by the capacity of the sequencing machine).

So one can measure the frequency of PCR duplicates in a real dataset, however this is not enough to parameterise a realistic simulation, as it does not inform us about the absolute expression levels and the average amplification efficiency. Hence the parameters must be tuned by trial and error.

I recommend the following strategy to tune the frequency of PCR duplicates:

Use the effest tool in order to estimate the relative expression levels, insert size distribution, amplification efficiencies and number of reads. Please do not forget to specify the list of single isoform genes through the -i flag! The effest tool will produce a JSON file with the parameter estimates and a fasta file with the estimated relative expression levels - these can be used to parameterise the rlsim runs.
The absolute expression levels can be tuned through a multiplier specified by the -m rlsim flag. Run repeated simulations with increasing/decreasing multipliers until the simulated dataset has the desired frequency of PCR duplicates. Than you can further tune the number of PCR duplicates through the -m flag.
There are other parameters which can be used to influence the frequency of PCR duplicates, such as the assumed mean efficiency in effest (-m), and the fragment loss probability flag (-flg) in rlsim, however I recommend the expression level multiplier (-m).

2) Regarding strandedness:

The effest tool has no special options for stranded data. The rlsim tool has a couple of options which come handy when simulating stranded data:

Use the strand bias parameter (-b) to tune the strandedness of the simulated fragments (0 means all fragments are sampled from the forward direction).
Run the pb_plot tool on your stranded data. If sequence biases are strong only at the beginning of the fragment, than you can use the after_prim fragmentation method instead of the default after_prim_double.
Note that rlsim simulates fragments only and the actual reads are simulated by the simNGS tool. Hence you can save the simulated fragments in a fasta file and use it to simulate both paired and single ended reads from the same library by changing the -p simNGS flag.

3) Regarding the size of the library:

The answer is pretty much the same as for the PCR duplicates - repeated simulations are required to tune the absolute expression levels so you get a large enough library in order to simulate a dataset with the desired properties. Please note that if your library is not large enough then you might not be able to sample from the desired size distribution and you will end up with “missing fragments”, which usually comes with an increased magnitude of “size selection biases”.

I hope this helps and that you will be able to tune the parameters by simulating with a couple of different values of -m.

score 0 · Answer 4 · 2013-04-23

0

Entering edit mode

11.0 years ago

Allen Kao ▴ 10

Issue regarding to "Segmentation fault while running effest" were post at Please help for effest "Segmentation fault (core dumped)" in the rlsim pacages.

ADD COMMENT • link 11.0 years ago by Allen Kao ▴ 10

0

Entering edit mode

Could you ask this as a separate question? This thread is getting a bit messy.

ADD REPLY • link 11.0 years ago by Botond Sipos ★ 1.7k

0

Entering edit mode

IC, I post here: Please help for effest "Segmentation fault (core dumped)" in the rlsim pacages

ADD REPLY • link 11.0 years ago by Allen Kao ▴ 10