NGS data simulation: VarSim or BAMSurgeon?
2
2
Entering edit mode
7.2 years ago
user230613 ▴ 360

Hi there,

I want to generate NGS data to do some test and benchmark in both germline and somatic variant calling. I've read a lot of papers about different tools and different tools benchmarks but I want to know your feedback. After reading the papers, I have chosen two tools: VarSim and BAMSurgeon.

  • BAMSurgeon uses pre-existing BAM files and adds new variants to them. It's has been widely used in DREAM challenge for testing variant calling algorithms so I assume that it works really nice. Using pre-existing BAM files, the advantage is that you can real data and then introduce new variants for the benchmarking.
  • For other hand, VarSim is able to generate read files taking as input a reference genome and a set of variants. All the data here is purely simulated (well, the variants can be random or previously described ones), and the advantage is that you can somehow control different types of error (like sequencing errors and so on). And also, having fastq files it is possible to test a full pipeline of Alignment+Variant_calling workflow.

At the end, What I would like to have is set of tumor/normal pair fastq files, with a true.vcf dataset, and then be able to play and adjust different parameters like: _clonality, heterogeneity, contamination, sequencing error.._

Sorry if the question is too open or wide. I'd like to receive suggestions and personal experiences about the best way to generate this kind of data. If its specific por Exome/Target sequencing would be even better.

Thank you in advance,

simulation varsim bamsurgeon • 6.4k views
ADD COMMENT
3
Entering edit mode
7.2 years ago
d-cameron ★ 2.9k

For somatic SV simulation, I'm yet to find a tool that can generate realistic data. The problem with simulating reads from the reference genome is that you present your variant caller with much easier problem that actual data. Real data is much messier (especially for repetitive sequence) and by simulating reads from the reference you will overestimate your variant callers' performance.

BAMSurgeon probably comes the closest to realistic data since it using existing sequencing data, but the types of SV events it can simulate are very limited and it does not handle some important classes of cancer driver mutations such as inter-chromosomal gene fusions. Additionally, the alignment-based event insertion approach taken by BAMSurgeon is not appropriate for repetitive regions as the BAMSurgeon approach assumes that the reads originating from the region that the event is to be simulated are correctly mapped to that region.

That said, I've used ART for SV simulation off hg19 but as you can see from my benchmarking results (http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ ), ROC curves for the simulated variants are vastly better than the ROC curves for real data. The simulations are useful for determining best-case variant caller performance (eg the smallest event size detectable by SV caller X), but should not be taken as reflecting performance on actual data.

These issues may be less problematic for SNV and small indel variants.

ADD COMMENT
0
Entering edit mode

Do you mean VarSim+Art when you say that you used Art?

ADD REPLY
0
Entering edit mode

Just ART from FASTA files. I created script to generate the FASTA files since VarSim only supports simple ins/del/dup/inv SVs.

Entire classes of somatic mutations (gene fusion, chromoplexy/chromothripsis/breakage-fusion-bridge, double minutes, ...) were missing from the simulators the last time I checked. By far the biggest issue I had with somatic simulations was the lack of aneuploidy and inter-chromosomal rearrangements. The majority of the cancers I've analysed were most definitely not simple diploid genomes with some SNVs and simple local rearrangements thrown in. 50+ copies of an unmutated oncogene is not unexpected for cancers showing signs of chromothripis/breakage-fusion-bridge.

ADD REPLY
0
Entering edit mode

I'm wondering the http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ is still available? I'm not able to see the results.

ADD REPLY
0
Entering edit mode

Unfortunately not. We do have a benchmarking paper with more comprehensive results coming out soon.

ADD REPLY
2
Entering edit mode
7.2 years ago
Joseph Hughes ★ 3.0k

Here is a recent paper that reviews different NGS read simulators. I think the decision tree figure is useful.

I had a related question and ended up using ART.

ADD COMMENT
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6