Question

Why cant we split and distribute BAM for parallel processing like in HTC?

0

Entering edit mode

6.5 years ago

javaclinic • 0

Back in Cern era it is used to distribute large data into multiple chunks and process it in parallel on different nodes.

why cannot we do this on BAM ? plz correct me if i am wrong , someone told me that BAM is monolethic and cannot be split and distribute as such

plz advice

/Zee

Bam HPC • 1.5k views

ADD COMMENT • link updated 6.5 years ago by Pierre Lindenbaum 161k • written 6.5 years ago by javaclinic • 0

score 2 · Answer 1 · 2017-10-24

2

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

why cannot we do this on BAM ?

depends of your analysis, but many tools are able to only use a genomic-section of the BAM. E.g: --intervals / -L in GATK: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_engine_CommandLineGATK.php#--intervals or samtools etc... most of the time, you don't have to split the bam, just split your genome.

Or, if your BAM is not visible from the Nodes, you can always split the bam per chromosome( with samtools view).

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thanks Pierre, what do you mean by " only use a genomic-section of the BAM" ?

ADD REPLY • link 6.5 years ago by javaclinic • 0

0

Entering edit mode

again it depends of your analysis, but it could be something like:say you're counting the number of reads:

process 1 : samtools view -c in.bam chr1 > chr1.count process 2 : samtools view -c in.bam chr2 > chr2.count process 3 : samtools view -c in.bam chr3 > chr3.count process N: samtools view -c in.bam chrN > chrN.count process N +1 : blablabala_for_sum chr1.count chr2.count (...) chrN.count > sum.count

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k