Hi all, I am trying to create a tool which utilises BLATv.35 in order to pairwise-align all transcripts in a gene (for all genes in the genome). I am using the commandline BLAT version which requires a .fasta file for each gene and produces a .psl for each gene, this is a lot of IO (as i produce 20,000 fasta files and 20,000 psl files for instance).
My starting point is a .fasta file with all the transcripts in the entire genome and a .txt file with the clustering of which transcripts are in which gene. Since their are hundreds of thousands of transcripts blatting the fasta file against itself is very time consuming and is very inefficient (since i only need to pairwise align transcripts from within the same gene and not the other 219,999 genes), so i want to split this into a blat instance per gene. But in order to do so i need a fasta file per gene, therein lies the problem.
My tool is a python tool (and i run blat as a command to the shell from within). I am trying to speed up my program by bypassing all the IO. Ideally i would pass a list (where each element was a transcript sequence) for each gene and blat the list against itself pairwise and then store the output all within a python framework (without needing to create a .fasta file and .psl for each gene).
My plan is one of the following:
1) modify the blat source code such that i can simply input arrays of strings, and output strings and encorporate this function into my python code using cython.
2) Use a pairwise aligner native to python (i have tested this already, these are much slower than BLAT)
3) By pass blat altogether, write/import my own aligner in C/C++ and use that instead.
I was simply wondering whether anyone else had thought or implemented a solution and which of the above solutions the biostar community thought would be best?
Thanks for your time, Anthony
As @Devon says IO is not the issue here,
how you do this? Did you tried multi process so you will divide your list into chunks and pass them all at the same time to python method implemented by python multiprocess module so they will all run at the same time
Because inevitably someone will try to literally run 20000 processes at once, I'll note that the idea is to create a pool of workers and use the map-reduce functionality provided by the
multiprocessing
module to go through them.Thanks :) ; I was just giving the Idea of using multiprocessing, but sure not to run 20000 processes
Thanks for the help medhat and Devon. I am indeed using multiprocess to split the job per gene so that i run a blat instance for each gene in parallel (and i am working with say 16 cores) . I pool my jobs too, this was the first time using multiprocess so i hope i have it set up well:
Where fnames is a list of fasta files (one for each gene) and worker is a small function to apply my code (including the BLAT part).
I simply (perhaps naively) thought that making a fasta file per gene and producing a psl file was the slow part (writing and reading lots of small files) so wanted to avoid that, but if you think that it isn't too time consuming, perhaps i cannot really optimize my code any more in that respect. From your experience how much does the deafault BLAT print out slow things down? I.e. Printing to screen: Loaded 1345 letters in 1 sequences Searched 16629 base 8 sequences 20,000 times. As far as i can see in the blat options there is no way around that...
Again, thanks so much for your input
Are you really sure that IO is rate limiting here? We're likely talking about at most a few gigs, which isn't going to be a real bottleneck.