Effectively search for positions in large bedgraph files using python (or other language if needed)
1
0
Entering edit mode
6.8 years ago
DVA ▴ 630

I guess it is more of an algorithm question. I have a group of large bedgraph files (shown below, contains every single position of the genome), each is about 20-30Gb in size. I need to look into these files and locate the lines having locations match another file (small, < 100kb) (e.g. file test.txt, see below). How should I do it efficiently please?

I read posts such as: https://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho but it is more for a general situation.

I think for my case I should take advantage of the fact that my files are "sorted", and that each time I use a line in "text.txt", to locate a line (line X for example) in bedgraph, anything above lineX is no longer relevant. I have some draft ideas about how to write it out in python, but also would like to hear suggestions from you.

I don't mind calling other languages in my python code. If there are existing software doing similar things, please let me know also. Thank you very much.

#largeFile.bedgraph (0 system)
chrM    0       1       5183
chrM    1       2       5299
chrM    2       3       5344
chrM    3       4       5439
chrM    4       5       5525
chrM    5       6       5579
...
chr19    3       4       6
chr20    4       5       35

#test.txt (1 system)
chrM    3
chrM    4
chr20   5

I would like to output:

chrM    3       4       5439
chrM    4       5       5525
chr20    4       5       35

because they are the record match test.txt.

bedgraph python • 2.3k views
ADD COMMENT
2
Entering edit mode
6.8 years ago

Sort your first file with sort-bed:

$ sort-bed --max-mem 2G --tmpdir $PWD largeFile.bedgraph > largeFile.bed

You only need to sort (resort?) once. Also, specifying --tmpdir ensures that you don't fill up /tmp with intermediate data, given the size of your input file.

Convert your second file to a sorted BED file:

$ awk '{ print $0"\t"($1+1); }' test.txt | sort-bed - > test.bed

Do a set operation on the two sorted BED files with bedops:

$ bedops --element-of 1 largeFile.bed test.bed > answer.bed

You could probably do this much faster with bedextract:

$ bedextract largeFile.bed test.bed > answer.bed

This is because largeFile.bed does not appear to contain nested elements, which allows use of bedextract to gain some significant optimizations to be done on a BED search.

ADD COMMENT
0
Entering edit mode

Hi @AlexReynolds Thank you so much for the reply and introducing bedops! I will give a short:)

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6