How to parse BAM files without requiring huge memory?
1
1
Entering edit mode
4.9 years ago

Hello,

Is it possible to parse the read entries in an (unaligned) BAM file one by one, without requiring huge amounts of memory?

I've tried samtools view <bamfile> | <python script> as well as using the pysam.AlignmentFile() parser from inside the script, but both solutions on our cluster end up taking over 60GB of RAM for a 6GB BAM. I do believe we have nodes that have can handle a lot more ram than that, but I'm still annoyed by requirements that wouldn't run on a laptop if needed.

I've briefly tried to look around, but nobody seems to be asking this question with regards to simply parsing a BAM. Most memory-related topics for samtools seem to revolve around sorting.

So, is there a more resource-efficient way to parse BAMs progressively, or does the whole thing need to be decompressed into memory first (presumably that's what's happening) before the entries can be accessed sequentially?

Thanks!

software error sequence samtools • 2.7k views
ADD COMMENT
2
Entering edit mode

Parsing a BAM file, read by read, should only require the memory needed for the read currently open. Niether of the approaches you've outlined should neccesarily mean a decompression of the whole file into memory. Two possibilities:

  • At some point your input iterator in your code is being converted to a list. For example by using .readlines() or list().

  • Something in your code is meaning that more information is being stored than is necessary for each read. Of course just storing every readname from a file is going to take a lot of memory for a big file, so its imperative that only the absolute minimum information is retained about each read.

Perhaps you'd be able to share some of your code with us?

ADD REPLY
0
Entering edit mode

It would be difficult to post the whole working code as it is part of a bigger thing. But I am not converting the iterator to a list:

samin = pysam.AlignmentFile(bam, 'rb', check_sq=False)
    for line in samin:
        seq = line.get_forward_sequence()
        # do regex stuff
samin.close()

What I'm saving is lists of match positions, short substrings. At the end I make collections.Counter() objects out of those and print them out.

In retrospect, the collective text representation of the counters is ~130MB. So I did some rough calc. If I was saving just the 50bp read sequence from each entry, that would easily add up to some 10s of GB. I'm not saving the whole reads or read names, but the amount of smaller stuff I do save is probably equivalent in the end.

So it is actually plausible the it's my script that's being the memory hog. Thanks for pointing that out!

The reason I thought it would be samtools, is that I've gotten used to even pure samtools operation like merge being memory hogs.

I guess now I have to figure out a way to do my stats on the fly instead of keeping it all in lists the whole time. Maybe dictionaries of counters, instead of lists of repeated values.

ADD REPLY
0
Entering edit mode

what do you want to do with the BAM file ?

ADD REPLY
0
Entering edit mode

I want to get my hands on the read sequences to collect statistics on the presence and position of certain sequence patterns.

ADD REPLY
2
Entering edit mode
4.9 years ago

It turns out that @i.sudbery was indeed correct. The memory-greedy bit was a dumb design choice in my script. I was storing information from each bam entry, only to count up the occurrences at the end. With 200000000 entries, the little bits of stored info added up... My goal is much more efficiently achieved by starting all the counters right from the beginning and updating them entry by entry. This is really basic stuff, I'm not even sure why I didn't do it that way from the beginning.

ADD COMMENT

Login before adding your answer.

Traffic: 2753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6