I guess it is more of an algorithm question. I have a group of large bedgraph files (shown below, contains every single position of the genome), each is about 20-30Gb in size. I need to look into these files and locate the lines having locations match another file (small, < 100kb) (e.g. file test.txt, see below). How should I do it efficiently please?
I read posts such as: https://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho but it is more for a general situation.
I think for my case I should take advantage of the fact that my files are "sorted", and that each time I use a line in "text.txt", to locate a line (line X for example) in bedgraph, anything above lineX is no longer relevant. I have some draft ideas about how to write it out in python, but also would like to hear suggestions from you.
I don't mind calling other languages in my python code. If there are existing software doing similar things, please let me know also. Thank you very much.
#largeFile.bedgraph (0 system)
chrM 0 1 5183
chrM 1 2 5299
chrM 2 3 5344
chrM 3 4 5439
chrM 4 5 5525
chrM 5 6 5579
...
chr19 3 4 6
chr20 4 5 35
#test.txt (1 system)
chrM 3
chrM 4
chr20 5
I would like to output:
chrM 3 4 5439
chrM 4 5 5525
chr20 4 5 35
because they are the record match test.txt.
Hi @AlexReynolds Thank you so much for the reply and introducing bedops! I will give a short:)