Question

Is there a good software to generate test genomics data?

0

Entering edit mode

5 months ago

Mark • 0

For example if I input a reference genome FASTA can I get simulated FASTQ files for ONT sequencing or PacBio sequencing runs that could have produced that data?

I'm trying to migrate over from Snakemake to Nextflow but from what I understand there is no option to perform dry runs in Nextflow so having a small dataset becomes a necessity rather than a recommendation. I'm wondering if there are any tools to help generate such data.

genomics testing benchmarking • 697 views

ADD COMMENT • link updated 5 months ago by Brian Bushnell 20k • written 5 months ago by Mark • 0

2

Entering edit mode

I usually just subset an existing large dataset to a few reads, peaks, genes, whatever is needed. You would need to add some details what exactly you want to simulate. I hear generally people use https://github.com/bcgsc/NanoSim for ONT data.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

If the purpose is just to see if the pipeline runs from start to finish, why don't you just downsample the real dataset? By the way, the dry-run option of snakemake is one of the features I like the most since even creating and running toy datasets maybe a lot of work if you just want to check that the input/output dependencies are correct.

ADD REPLY • link 5 months ago by dariober 14k

score 2 · Answer 1 · 2023-11-22

Using BBTools:

#If you have a genome you can skip this step
randomgenome.sh gc=0.5 len=4m out=genome.fa

#Make synth reads
randomreads.sh ref=genome.fa out=reads.fq reads=10k pacbio gaussianlength minlength=500 maxlength=20000

You should also set these flags as needed. The defaults generate simulated raw reads (high error rate) so pbmin and max should be more like 0.0001 and 0.01 for CCS reads. Also,

pbmin=0.13      Minimum rate of PacBio errors for a read.
pbmax=0.17      Maximum rate of PacBio errors for a read.
minlength=150   Generate reads of up to this length.
maxlength=150   Generate reads of at least this length.
midlength=-1    Gaussian curve peaks at this point.  Must be between
                minlength and maxlength, in Gaussian mode.
readlengthsd=-1 Standard deviation of the Gaussian curve.  Note that the
                final curve is a sum of multiple curves, but this will affect
                overall curve width.  By default this is set to 1/4 of range.

If you want you can also run "mutate.sh" on the genome to make a slightly different genome, and generate reads from that instead, which contain variants.

score 0 · Answer 2 · 2023-11-24

Badread by Ryan Wick is a good read simulator for nanopore data https://github.com/rrwick/Badread

Another approach is to align a real read set, then take all reads from the bam which are aligned to a certain region (eg mitochondrion, part of chr1) and create a bam from that. That's more useful if you need a certain coverage, eg for SNP calling etc.