Fast datastructure to get counts of overlapping reads per nucleotide (!) in Python?
2
4
Entering edit mode
7.9 years ago
endrebak ▴ 960

I want to keep track of how many reads are overlapping each nucleotide in my genome, so that I can quickly get the number of overlapping reads with dict[chromosome][strand][nucleotide].

This is possible with intervaltrees from bx-python, but afair, that module did not work with multiprocessing, and it might not be the fastest for this specific task.

Are there other modules/packages that do this?

python data-structure • 2.5k views
ADD COMMENT
0
Entering edit mode

You already have a read count for each chromosome/strand/position (a wig/bedgraph file)? How much memory do you have on your machine and how big is your genome? You don't really need to do anything too clever if you have enough hardware.

ADD REPLY
0
Entering edit mode

Thanks. I'm trying to revive and old peak caller algorithm that performed better than macs2 in many benchmarks. Problem is (apart from it being brittly coded) that it uses a lot of comparisons between #reads in nucleotide positions, so I better have to come up with something clever :) I think bx-python/cython and some thinking will save me though.

ADD REPLY
0
Entering edit mode

I don't think you'll get lookups significantly faster than a dict, unless you go with another language. Dicts are pretty well optimized in python since so much depends on them.

I don't see how interval trees will be faster than dict lookups either unless you are looking for range overlaps?

Looking at your post history, you are trying to write a more performant version of SICER? DeepTools (http://deeptools.readthedocs.io/en/latest/index.html) does a lot of these types of genome wide coverage operations. And it is written in python. Maybe you can get some ideas from them.

ADD REPLY
0
Entering edit mode

Nah, it is a 2012 method called triform, the new sicer is already out at https://github.com/endrebak/epic

ADD REPLY
6
Entering edit mode
7.9 years ago

With python you have a few options depending on your expertise. The numbering does not reflect my perferences. The ease and speed of each option I listed is dependent on the size of your input and the number of regions to parse. If you have a lot of regions I would recommend using samtools.

  1. pybedtools is a great wrapper of Bedtools in python. It can get slow depending on how many times you call the functions and how large your data are. However if you're familiar with intersection positions in Bedtools this should be a breeze!
  2. Cython. You can use C types and np.ndarrays to iterate much faster in Cython
  3. Use samtools: For your question

    how many reads are overlapping each nucleotide

    samtools depth is perfect for this!

  4. pysam: same thing a 3. But now you are using a htslib API in python.

ADD COMMENT
3
Entering edit mode

Pysam can be pretty fast. Since it's for an entire genome, you'd just iterate over the bam file.

ADD REPLY
3
Entering edit mode

Just don't open and close the bam file handle and iterate using fetch I <3 pysam

ADD REPLY
2
Entering edit mode

But these are not dict-like structures that allow me nucleotide lookups. Upvote, since my answer was probably unclear and you answered it.

ADD REPLY
3
Entering edit mode

I see, well in that case would it be easier for you to break it up into two scripts? You first generate the per base pair coverage and then in a different script you match the coverage to the nucleotide? Cython can make it run faster if you index based on position and have a cdef char for your nucleotide.

With Bedtools/pybedtools

Your nucleotide position file will look like this chrom position position strand nucleotide

Output from samtools depth (You will need to reformat it by duplicating the position column for bedtools)

chrom position position depth-of-coverage

Then $ intersectBed -a POSITION_FILE -b DEPTH_FILE -wa -wb

You're output should look like this

chrom position position strand nucleotide chrom position position depth-of-coverage

And then it's a simple parse. However this is not allelic depth but total read coverage over the position

Another option with Haplotype Caller

It sounds like you may be interested in allelic depth? Are you trying to count the coverage of a heterozygous SNP? In that case using HaplotypeCaller (assuming you are academic) with a list of regions you are interested might be the quickest route. In the end you would just need to parse through the VCF.

Example output for HaplotypeCaller

10      68533   rs35819232      T       G       826.77  .       <INFO OMITTED>     GT:AD:DP:GQ:PL  1/1:0,26:26:78:855,78,0
10      69071   rs146345459     G       A       1286.77 .       <INFO OMITTED>    GT:AD:DP:GQ:PL 0/1:16,47:63:99:1315,0,306

The first entry is ALT homozygous with 26 supporting reads. Allelic depth (AD in the FORMAT column) is 0,26

The second entry is heterozygous with 16 reads supporting the REF allele and 47 reads supporting the ALT allele (you expect a near 1:1 ratio for good quality HET snps, this may not be one, or it's in a CNV)

ADD REPLY
1
Entering edit mode

Thanks for the hints. I'm not sure that dicts are faster in cython though. They are not restricted to a single type even there.

ADD REPLY
0
Entering edit mode

It will be slightly faster if you use ctypes for positions and nucleotides. It will be noticeably faster if you break up your dict into a np.ndarray with ctypes. Good luck!

ADD REPLY
1
Entering edit mode

Is using a numpy array as a lookup-table really faster than a dict? Did not know. Sounds strange, since they are made for vectorized operations, but thanks, will try.

ADD REPLY
0
Entering edit mode

certain parts of numpy is C optimized. If you're going to do a lot of for: loops in python, which are notoriously slow, you can speed it up using numpy vectors and Cython. You will have to cimport numpy as py This blog helped me, but I was not able to run code near the bottom of the page without errors.

Really sped up my recursive script!

ADD REPLY

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6