Why cant we split and distribute BAM for parallel processing like in HTC?
1
0
Entering edit mode
6.5 years ago
javaclinic • 0

Back in Cern era it is used to distribute large data into multiple chunks and process it in parallel on different nodes.

why cannot we do this on BAM ? plz correct me if i am wrong , someone told me that BAM is monolethic and cannot be split and distribute as such

plz advice

/Zee

Bam HPC • 1.5k views
ADD COMMENT
2
Entering edit mode
6.5 years ago

why cannot we do this on BAM ?

depends of your analysis, but many tools are able to only use a genomic-section of the BAM. E.g: --intervals / -L in GATK: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_engine_CommandLineGATK.php#--intervals or samtools etc... most of the time, you don't have to split the bam, just split your genome.

Or, if your BAM is not visible from the Nodes, you can always split the bam per chromosome( with samtools view).

ADD COMMENT
0
Entering edit mode

thanks Pierre, what do you mean by " only use a genomic-section of the BAM" ?

ADD REPLY
0
Entering edit mode

again it depends of your analysis, but it could be something like:say you're counting the number of reads:

process 1 : samtools view -c in.bam chr1 > chr1.count process 2 : samtools view -c in.bam chr2 > chr2.count process 3 : samtools view -c in.bam chr3 > chr3.count process N: samtools view -c in.bam chrN > chrN.count process N +1 : blablabala_for_sum chr1.count chr2.count (...) chrN.count > sum.count

ADD REPLY

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6