Tool:Rlsim, A Package For Simulating Rna-Seq Library Preparation With Parameter Estimation
4
14
Entering edit mode
11.0 years ago
Botond Sipos ★ 1.7k

What is the rlsim package?

The rlsim package is a collection of tools for simulating RNA-seq library construction, aiming to reproduce the most important factors which are known to introduce significant biases in the currently used protocols: hexamer priming, PCR amplification and size selection.

It allows for a systematic exploration of the effects of the individual biasing factors and their interactions on downstream applications by simulating data under a variety of parameter sets.

The implicit simulation model implemented in the main tool (rlsim) is inspired by actual library preparation protocols and it is more general than the models used by the bias correction methods hence it allows for a fair assessment of their performance.

Although the simulation model was kept as simple as possible in order to aid usability, it still has too many parameters to be inferred from data produced by standard RNA-seq experiments. However, simulating datasets with properties similar to specific datasets is often useful. To address this, the package provides a tool (effest) implementing simple approaches for estimating the parameters which can be recovered from standard RNA-seq data (GC-dependent amplification efficiencies, fragment size distribution, relative expression levels).

The latest release and the package source is available from the rlsim GitHub repository: https://github.com/sbotond/rlsim

Citing the rlsim package

The rlsim manuscript is now on arXiv, with the analysis pipeline at https://github.com/sbotond/paper-rlsim:

  • Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman (2013) Realistic simulations reveal extensive sample-specificity of RNA-seq biases arXiv:1308.3172

Getting more help

Please consult the package documentation for more help on the tools and the technical background. Also feel free to ask questions on BioStars, I will monitor the rlsim tag.

simulation rna-seq pcr illumina • 6.1k views
ADD COMMENT
2
Entering edit mode
11.0 years ago

Great stuff, we need tools like this so much!

I firmly believe that just about any bioinformatics training needs to start with teaching people how to generate data with certain properties.

Only knowing what is inside of a simulated data allows one to actually be able to understand how well some other methodology performs.

ADD COMMENT
0
Entering edit mode

Thanks! I hope that it will be useful for the RNA-seq community.

ADD REPLY
1
Entering edit mode
11.0 years ago

This is a really interesting tool. I have a couple of questions concerning PCR duplicates and strand oriented libraries. In case of PCR duplicates, can I simulate libraries including different percentages of PCR duplicates? Normally I observe high duplication rates and I would understand main effects in gene expression. Since real data comes with duplicates, can I infer PCR duplicates (from SAM files) and use such values in rlsim? In case of strand oriented libraries, which parameter should I use? A last question concerns the number of fragment to simulate. My aim is to simulate a full HiSeq 2000 lane (single or paired-end reads). How big the library should be?

ADD COMMENT
1
Entering edit mode
11.0 years ago
Botond Sipos ★ 1.7k

@Ernesto: That is quite a complex question, deserving multiple answers:

1) Regarding PCR duplicates:

First, let’s consider the major factors influencing the number of PCR duplicates in a simulated (or real) RNA-seq experiment:

  • The amount of starting material - or in other words the absolute transcript-wise expression levels. This cannot be estimated from standard RNA-seq data, so the effest tool will give you only rough estimates of the relative expression levels.
  • The PCR amplification efficiencies - effest will give you a estimate of this as a function of the fragment GC content, assuming an average efficiency of 0.87.
  • The number of sampled reads (which is usually fixed by the capacity of the sequencing machine).

So one can measure the frequency of PCR duplicates in a real dataset, however this is not enough to parameterise a realistic simulation, as it does not inform us about the absolute expression levels and the average amplification efficiency. Hence the parameters must be tuned by trial and error.

I recommend the following strategy to tune the frequency of PCR duplicates:

  • Use the effest tool in order to estimate the relative expression levels, insert size distribution, amplification efficiencies and number of reads. Please do not forget to specify the list of single isoform genes through the -i flag! The effest tool will produce a JSON file with the parameter estimates and a fasta file with the estimated relative expression levels - these can be used to parameterise the rlsim runs.
  • The absolute expression levels can be tuned through a multiplier specified by the -m rlsim flag. Run repeated simulations with increasing/decreasing multipliers until the simulated dataset has the desired frequency of PCR duplicates. Than you can further tune the number of PCR duplicates through the -m flag.
  • There are other parameters which can be used to influence the frequency of PCR duplicates, such as the assumed mean efficiency in effest (-m), and the fragment loss probability flag (-flg) in rlsim, however I recommend the expression level multiplier (-m).

2) Regarding strandedness:

The effest tool has no special options for stranded data. The rlsim tool has a couple of options which come handy when simulating stranded data:

  • Use the strand bias parameter (-b) to tune the strandedness of the simulated fragments (0 means all fragments are sampled from the forward direction).
  • Run the pb_plot tool on your stranded data. If sequence biases are strong only at the beginning of the fragment, than you can use the after_prim fragmentation method instead of the default after_prim_double.
  • Note that rlsim simulates fragments only and the actual reads are simulated by the simNGS tool. Hence you can save the simulated fragments in a fasta file and use it to simulate both paired and single ended reads from the same library by changing the -p simNGS flag.

3) Regarding the size of the library:

The answer is pretty much the same as for the PCR duplicates - repeated simulations are required to tune the absolute expression levels so you get a large enough library in order to simulate a dataset with the desired properties. Please note that if your library is not large enough then you might not be able to sample from the desired size distribution and you will end up with “missing fragments”, which usually comes with an increased magnitude of “size selection biases”.

I hope this helps and that you will be able to tune the parameters by simulating with a couple of different values of -m.

ADD COMMENT
1
Entering edit mode

@Botond, thanks a lot for your explanation. Concerning PCR duplicates you suggest to use the -m parameter. At this stage I don't want to infer parameters by effest, so image to have N transcripts with a given expected expression level. I correctly understand, running rlsim on this dataset I get fragments without PCR duplicates, right? Then if I want to add duplicates I can use -m parameter that is a mutiplier. However, the duplication rate should be equal for each transcript since we multiply the expression level by the same number, right? Looking at rlsim parameters, what is the meaning of -c option? PCR duplicates may be created at this stage. if you perform many cycles, you should expect also many PCR duplicates. So, I'm wondering the effect of -c parameter. Could you please clarify the espression "repeated simulations are required to tune the absolute expression levels"? Should I perform diverse simulations to get expected results? Thank you very much in advance.

ADD REPLY
0
Entering edit mode

@Ernesto: I have suggested using the -m flag for tuning the frequency of PCR replicates as I assumed that you are trying to replicate the properties of a real dataset generated by a specific experiment (with a fixed number of cycles). But in a more general case of course you have more control:

  • The -c parameter is the number of PCR cycles. If we fix all other parameters and increase the number of cycles than of course the frequency of replicates will increase. However the increase is not necessarily dramatic if the number of starting material is very high.
  • Note that the -m parameter multiplies the expression levels, hence it will decrease the frequency of PCR duplicates.
  • In your example case (transcripts with equal expression levels) you need to tune other parameters as well in order to get an equal rate of duplicates: transcripts have to have the same length, you have to use equal amplification efficiencies (-e 1.0) and you have to use a fragmentation method without priming simulation (e.g. after_noprim_double).
  • Also, the only way to guarantee that there are no PCR duplicates is to set the cycle number to zero (by default 11 cycles are simulated). This of course will also result in a much smaller library size.
  • And yes, my advice is that you run diverse simulations in order to tune the properties of your dataset. For example, you could search for suitable parameter ranges using a strategy similar to binary search.

My general advice is that you read the full (or at least the “Background” sections) of the documentation before you run any simulations in order to become familiar with the rlsim simulation architeture.

ADD REPLY
0
Entering edit mode
11.0 years ago
Allen Kao ▴ 10

Issue regarding to "Segmentation fault while running effest" were post at Please help for effest "Segmentation fault (core dumped)" in the rlsim pacages.

ADD COMMENT
0
Entering edit mode

Could you ask this as a separate question? This thread is getting a bit messy.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 3041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6