Biostars beta testing.
Question: How to increase the speed of subread for DNA whole genome sequence
0
Entering edit mode

Hi All,

I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

  ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.22.3

//========================== subread-align setting===========================\\
||                                                                            ||
|| Function      : Read alignment (DNA-Seq)                                   ||
|| Input file 1  : 141023_H0E72_113-JG4_L003_R1.fastq.gz                      ||
|| Input file 2  : 141023_H0E72_113-JG4_L003_R2.fastq.gz                      ||
|| Output file   : subread_jg4.sam (BAM)                                      ||
|| Index name    : my_index                                                   ||
||                                                                            ||
||                       Threads : 40                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : yes                                        ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ==============

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

Thanks

Tarek

ADD COMMENTlink 3.4 years ago tarek.mohamed • 250 • updated 21 months ago andysw90 • 20
Entering edit mode
0
1) How many reads do you have (or, how big are your input files in GB)?
2) What reference are you aligning them to?
ADD REPLYlink 3.4 years ago
Brian Bushnell
16k
Entering edit mode
0

Hi Brian,

Thanks for the reply,

I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.

Tarek

ADD REPLYlink 3.4 years ago
tarek.mohamed
• 250
Entering edit mode
0

OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.

You could try BBMap (which I wrote) like this:

bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100

For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

...then run each pair on a different node and merge the bam files.

ADD REPLYlink 3.4 years ago
Brian Bushnell
16k
Entering edit mode
0

Hello tarek.mohamed!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink 3.4 years ago
WouterDeCoster
39k
Entering edit mode
0

Hi wouterDeCoster

I did not know that I am not allowed to post same post in 2 different site.

Tarek

ADD REPLYlink 3.4 years ago
tarek.mohamed
• 250
Entering edit mode
0

It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.

ADD REPLYlink 3.4 years ago
genomax
68k
Entering edit mode
0

I see.

Thanks

ADD REPLYlink 3.4 years ago
tarek.mohamed
• 250
0
Entering edit mode

Outputting to bam rather than sam might help -- smaller files = less time writing.

ADD COMMENTlink 21 months ago andysw90 • 20
Entering edit mode
0

Doesn't writing BAM takes longer due to the compression time?

ADD REPLYlink 21 months ago
ATpoint
17k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0