Question

pysam + multiple_iterators = Type Error during multiprocessing

1

Entering edit mode

6.5 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.5k

I would like to parse a BAM file in parallel using pysam and multiple_iterators

Here is my code

import pysam
import sys
from multiprocessing import Pool
import time
def countReads(chrom,Bam):
    count=0
    #Itr = Bam.fetch(str(chrom),multiple_iterators=False)
    Itr = Bam.fetch(str(chrom),multiple_iterators=True)
    for Aln in Itr: count+=1

if __name__ == '__main__':
    start = time.time()
    chroms=[x+1 for x in range(22)]
    cpu=6
    BAM = sys.argv[1]
    bamfh = pysam.AlignmentFile(BAM)
    pool = Pool(processes=cpu)
    for x in range(len(chroms)):
        pool.apply_async(countReads,(chroms[x],bamfh,))
        #countReads(chroms[x],bamfh)
    pool.close()
    pool.join()
    end = time.time()
    print(end - start)

I get this error when I run it.

TypeError: _open() takes at least 1 positional argument (0 given)

But it spits out a whole bunch of errors. Can anyone help me to use multiprocessing to read a BAM file in parallel using pysam?

Thanks

python • 4.6k views

ADD COMMENT • link 6.5 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

score 2 · Accepted Answer · 2017-10-04

2

Entering edit mode

6.5 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.5k

fixed it. I was going off some online blog that was wrong

import pysam
import sys
from multiprocessing import Pool
import time
def countReads(chrom,BAM):
    count=0
    # here's the fix
    bam = pysam.AlignmentFile(BAM,'rb')

    Itr = bam.fetch(str(chrom),multiple_iterators=True)
    for Aln in Itr: count+=1

if __name__ == '__main__':
    start = time.time()
    chroms=[x+1 for x in range(22)]
    cpu=6
    BAM = sys.argv[1]
    pool = Pool(processes=cpu)
    for x in range(len(chroms)):
        pool.apply_async(countReads,(chroms[x],BAM,))
    #countReads(chroms[x],bamfh)
    pool.close()
    pool.join()
    end = time.time()
    print(end - start)

ADD COMMENT • link 6.5 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

0

Entering edit mode

This makes sense to me - except, do you really need the multiple_iterators = True here, then?

From pysam documentation for fetch():

multiple_iterators (bool) – If multiple_iterators is True, multiple iterators on the same file can be used at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.

It is my understanding that in the code above, you moved opening of the file (bam = pysam.AlignmentFile(BAM, 'rb')) and creating a separate filehandle to each separate thread. Therefore do you need to also include multiple_iterators = True? That sounds like doing the same thing twice.

I am asking because I'd like to use something very similar, but the countReads() function would look something like this instead:

def countReads(regions, chrom, BAM):
    count = 0
    bam = pysam.AlignmentFile(BAM, 'rb')

    for start, stop in regions:
        Itr = bam.fetch(str(chrom), start, stop, multiple_iterators = True)
        for Aln in Itr:
            count += 1

Including multiple_iterators = True here would reopen the file for every region of the chromosome, which would make this a much slower process.

EDIT: I believe that this issue thread on pysam's Git repo confirms the claim above: multiple_iterators = True is only needed when using multiple iterators in the same process; when opening a separate file handle in each process, multiple_iterators = True should not be necessary.

ADD REPLY • link 3.6 years ago by Brunox13 ▴ 50