Dear all,
I have a .bam
, as output from STAR aligner, from which I need to extract some info using RSeQC while using all the computational resources available to increase speed. To accomplish that I am using GNU parallel. Currently, I am doing it like this:
Split the main .bam per chromosome, generating files with the suffix Aligned.sortedByCoord.out.REF
bamtools split -in Aligned.sortedByCoord.out.bam -reference
Calculate the clipping profile
ls Aligned.sortedByCoord.out.REF_* | parallel --joblog clipping_profile.log --eta --bar --max-procs 20 --jobs 20 'clipping_profile.py -i {} -s "SE" -o RSeQC.clipping.profile/clipping_profile.{.}'
This will generate a .r
, .pdf
and a .xlxs
for each chromosome, thus ending up with a lot of files. I am forced to split the .bam
per chromosome as I am not able to find a --recstart
nor a --recend
in the compressed .bam
.
Question 1
Is there a better way to split a .bam
file using GNU parallel and RSeQC' programs (e.g. split_bam.py
), such that one gets the outputs from a single input file?
Inspired by ole.tange's thread, I have also tried this, though unsuccessfully:
cat star_outputAligned.sortedByCoord.out.bam | parallel --block 1G -k --joblog splitbam3.log --max-procs 20 --jobs 20 --pipe 'split_bam.py -i {} -r ../../My_genomes/GCA.GRCh38/GRCh38_wgEncodeGencodeAttrsV28_all_rRNA.bed12 -o splitbam3'
Question 2
If using the split option in the first step, is there a way to merge all output to get the results of the whole file (as if never had been split)?
I assume you are referring to getting a single output file set. You can make your own judgement by looking at this but it would not be a good thing to do since you are looking to generate specific stats on individual chromosomes. You may end up with an interleaved file (assuming the writing to single file works).
It would be safe to merge the resulting files after all processing is done. The process is described here for Excel files (and there are other solutions you can find by searching).
Hello @genomax
Yes. The output should be as if I had run, but faster:
Right, then no merging while writing the output file.
This will be a nice solution but, probably, the time I saved by using parallel will be spent in the merging step. Is there a way to use, for instance, parallel's blocks to do it in fewer steps?
Thanks for replying.