I am simulating test reads with randomreads.sh
from BBMap package.
My command is
/Software/bbmap/randomreads.sh ref=C_glabrata_CBS138_current_chromosomes.fasta out1=read1.fastq out2=read2.fastq build=1 length=10 reads=2 coverage=-1 replacenoref=t simplenames=t seed=-1 paired=t metagenome=t addpairnum=t
The output is:
for read1
@0_+106297_106306_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 1:
AGGTTTTAAT
+
989945?446
@1-_438159_438168_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 1:
ATATCTTCCT
+
AA?CB?@bcb`
for read2 :
@0_-761178_761187_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 2:
AGCAGAGAGA
+
?>64844:64
@1+_646646_646655_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 2:
TGCCAGTTTC
+
??B@A?><@>
As far as I understand the reads @0 from read1 and @0 from read2 correspond to the read pair. However based on their coordinates they are super far away form each other. Is this a normal behaviour of the software?
Thanks
P.S. Hope Brian Bushnell will see this post :)
Is there a reason you have
metagenome=t
set? 10 bp reads don't make much sense either.in this test case - no, there is no reason. But actually for my pipeline I will need that option to simulate RNASeq reads distribution. By the way, in case of large transcriptomes (like for human), that option extremely slows down the simulation speed. I guess it worth for a separate question.
Trying to simulate a metagenome (while providing a single reference) may be causing the strange behavior. Can you remove that option and see if this look more reasonable?
@Brian had these notes on metagenome simulations:
Thanks @genomax. It seems that
mininsert=0
solved the issue. From your comment, it also makes sense for me why it takes so long to simulate data for human transcriptome withmetagenome=t
. Do you have experience with large transcriptomes and randomreads.sh? For me it takes ca 15 min to simulate 100 reads.Did you try to simulate longer reads, like 100bp or 250bp?
For the question of my original post, so yes - with longer reads there are no problems, because apparently you always get a positive values for insert sizes. But for speed of simulating large transcriptomes, so I am trying now with longer reads and larger number of reads.