Question

randomreads.sh simulated mates are far away

0

Entering edit mode

5.8 years ago

grant.hovhannisyan ★ 2.6k

I am simulating test reads with randomreads.sh from BBMap package.

My command is

/Software/bbmap/randomreads.sh ref=C_glabrata_CBS138_current_chromosomes.fasta out1=read1.fastq out2=read2.fastq build=1 length=10 reads=2 coverage=-1 replacenoref=t simplenames=t seed=-1 paired=t metagenome=t addpairnum=t

The output is:

for read1

@0_+106297_106306_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 1:
AGGTTTTAAT
+
989945?446
@1-_438159_438168_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 1:
ATATCTTCCT
+
AA?CB?@bcb`

for read2 :

@0_-761178_761187_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 2:
AGCAGAGAGA
+
?>64844:64
@1+_646646_646655_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 2:
TGCCAGTTTC
+
??B@A?><@>

As far as I understand the reads @0 from read1 and @0 from read2 correspond to the read pair. However based on their coordinates they are super far away form each other. Is this a normal behaviour of the software?

Thanks

P.S. Hope Brian Bushnell will see this post :)

bbmap • 1.2k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Is there a reason you have metagenome=t set? 10 bp reads don't make much sense either.

ADD REPLY • link 5.8 years ago by GenoMax 142k

0

Entering edit mode

in this test case - no, there is no reason. But actually for my pipeline I will need that option to simulate RNASeq reads distribution. By the way, in case of large transcriptomes (like for human), that option extremely slows down the simulation speed. I guess it worth for a separate question.

ADD REPLY • link 5.8 years ago by grant.hovhannisyan ★ 2.6k

1

Entering edit mode

Trying to simulate a metagenome (while providing a single reference) may be causing the strange behavior. Can you remove that option and see if this look more reasonable?

@Brian had these notes on metagenome simulations:

metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

ADD REPLY • link 5.8 years ago by GenoMax 142k

0

Entering edit mode

Thanks @genomax. It seems that mininsert=0 solved the issue. From your comment, it also makes sense for me why it takes so long to simulate data for human transcriptome with metagenome=t. Do you have experience with large transcriptomes and randomreads.sh? For me it takes ca 15 min to simulate 100 reads.

ADD REPLY • link 5.8 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Did you try to simulate longer reads, like 100bp or 250bp?

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

For the question of my original post, so yes - with longer reads there are no problems, because apparently you always get a positive values for insert sizes. But for speed of simulating large transcriptomes, so I am trying now with longer reads and larger number of reads.

ADD REPLY • link 5.8 years ago by grant.hovhannisyan ★ 2.6k

score 1 · Answer 1 · 2018-07-16

1

Entering edit mode

5.8 years ago

h.mon 35k

Indeed the paired reads seem really far apart. randomreads.sh help does not explain how read length controls default insert size, but you can control insert size with the parameters:

mininsert=          Controls minimum insert length.  Default depends on read length.
maxinsert=          Controls maximum insert length.  Default depends on read length.
triangle=t          Make a triangular insert size distribution.
flat=f              Make a roughly flat insert size distribution..
superflat=f         Make a perfectly flat insert size distribution.
gaussian=f          Make a bell-shaped insert size distribution, with 
                    standard deviation of (maxinsert-mininsert)/6.

ADD COMMENT • link 5.8 years ago by h.mon 35k

0

Entering edit mode

It seems I have found a reason. When you specify a very short read length (for example 10), the software prints to stdout this line insert size=-80-120. I think this negative insert size messes up everything. And you are right, it is not mentioned how the insert size is calculated based on read length.

EDIT: OK, it seems that setting mininsert=0 fixes the issue. Seems that its a bug.

ADD REPLY • link 5.8 years ago by grant.hovhannisyan ★ 2.6k