Biostar Beta. Not for public use.
How to correctly parallelise RSeQC scripts with GNU parallel?
1
Entering edit mode
14 months ago
Colombia, Bogotá, Pontifical Xavierian …

Dear all,

I have a .bam, as output from STAR aligner, from which I need to extract some info using RSeQC while using all the computational resources available to increase speed. To accomplish that I am using GNU parallel. Currently, I am doing it like this:

  1. Split the main .bam per chromosome, generating files with the suffix Aligned.sortedByCoord.out.REF

    bamtools split -in Aligned.sortedByCoord.out.bam  -reference
    
  2. Calculate the clipping profile

    ls Aligned.sortedByCoord.out.REF_* | parallel --joblog clipping_profile.log --eta --bar --max-procs 20 --jobs 20 'clipping_profile.py -i {} -s "SE" -o RSeQC.clipping.profile/clipping_profile.{.}'
    

This will generate a .r, .pdfand a .xlxs for each chromosome, thus ending up with a lot of files. I am forced to split the .bam per chromosome as I am not able to find a --recstart nor a --recend in the compressed .bam.

Question 1

Is there a better way to split a .bam file using GNU parallel and RSeQC' programs (e.g. split_bam.py), such that one gets the outputs from a single input file?

Inspired by ole.tange's thread, I have also tried this, though unsuccessfully:

cat star_outputAligned.sortedByCoord.out.bam | parallel --block 1G -k --joblog splitbam3.log --max-procs 20 --jobs 20 --pipe 'split_bam.py -i {} -r ../../My_genomes/GCA.GRCh38/GRCh38_wgEncodeGencodeAttrsV28_all_rRNA.bed12 -o splitbam3'

Question 2

If using the split option in the first step, is there a way to merge all output to get the results of the whole file (as if never had been split)?

ADD COMMENTlink
0
Entering edit mode

such that one gets the outputs from a single input file?

I assume you are referring to getting a single output file set. You can make your own judgement by looking at this but it would not be a good thing to do since you are looking to generate specific stats on individual chromosomes. You may end up with an interleaved file (assuming the writing to single file works).

It would be safe to merge the resulting files after all processing is done. The process is described here for Excel files (and there are other solutions you can find by searching).

ADD REPLYlink
0
Entering edit mode

Hello @genomax

I assume you are referring to getting a single output file set.

Yes. The output should be as if I had run, but faster:

clipping_profile.py -i  Aligned.sortedByCoord.out.bam -s "SE" -o RSeQC.clipping.profile/clipping_profile

You may end up with an interleaved file (assuming the writing to single file works).

Right, then no merging while writing the output file.

The process is described here for Excel files

This will be a nice solution but, probably, the time I saved by using parallel will be spent in the merging step. Is there a way to use, for instance, parallel's blocks to do it in fewer steps?

Thanks for replying.

ADD REPLYlink
1
Entering edit mode
2 days ago
ATpoint 17k
Germany

I cannot comment on the RseQC, but splitting a BAM into its chromosomes using parallel could be done like this. It makes use of the samtools syntax samtools view options BAM <chromosome>

samtools idxstats in.bam | cut -f1 | grep -v '\*' | parallel "samtools view -b -o out_{}.Aligned.sortedByCoord.out.bam in.bam {}"
ADD COMMENTlink
0
Entering edit mode

Actually this is a nice solution. Usually, I have to wait until bamtools split -in file.bam -reference finish, so there is some time that I can definitely save by using your code.

Thanks for replying.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1