Question

popoolation parallelization problem

1

Entering edit mode

6.9 years ago

dovi ▴ 60

Hi all,

I am using popoolation for my pool-seq data to calculate fst values using fst-sliding.pl script. As the synchronized file weights ~2TB, I though that it would be better to parallelize the process. I've therefore split the synchronized file into small chunks and I run multiple jobs (max of 50 jobs) at the same time in a cluster. Each job should take ~1 day. However when I start running multiple jobs I see that about ~20 output files start being filled (increase size file of popoolation output .fst) but the others remain empty until these first have finished. I know that all of them actually started because I can see the .params file for all of them. I have no clue why is this happening, I thought maybe is because all jobs call the same perl script? and somehow can't handle so many calls?? Did someone experienced that before or have any clue how to solve it?

Thanks!

popoolation parallelize cluster • 2.1k views

ADD COMMENT • link 6.9 years ago by dovi ▴ 60

1

Entering edit mode

For those wondering (like I did) popoolation is a NGS data analysis pipeline for comparing population allele frequencies.

ADD REPLY • link 6.9 years ago by GenoMax 141k

1

Entering edit mode

have you tried looking at temporary files that popoolation generates, or better yet at the code to see how it generates them?

ADD REPLY • link 6.9 years ago by LauferVA 4.2k

0

Entering edit mode

It doesn't seem to generate a temporary file, for what I see in the code it directly prints the result into the output file

ADD REPLY • link 6.9 years ago by dovi ▴ 60

1

Entering edit mode

hmm. if the problem isnt with file handling in their scripts, my next most likely guesses are:

1) system resource requests on your cluster 2) file handling in your scripts 3) system resource allocation on your cluster

ADD REPLY • link 6.9 years ago by LauferVA 4.2k

0

Entering edit mode

I thought maybe is because all jobs call the same perl script? and somehow can't handle so many calls??

Sounds unlikely. But perhaps the load on the server got too big?

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

The cluster has the load manager MOAB, which schedules and manages all jobs, so I understand that if a job is running is because is not overloading the server and it can run properly (otherwise stays as 'waiting'). Or is not this what you mean?

ADD REPLY • link 6.9 years ago by dovi ▴ 60

0

Entering edit mode

I'm not sure. I'm not familiar with popoolation or MOAB.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I know nothing about poopolation but are you sure you are able to do this sort of brute force parallelization? You may want to ask the program authors just to be sure.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Just following up this threat, have you figured it out? I'm looking to do something similar. Can you explain how you implemented the parallelisation? Did you breakup the sync file in chunks and send individual jobs?

ADD REPLY • link 6.8 years ago by hern.moral • 0

0

Entering edit mode

Well, I think my problem was related to the cluster I was using, sometimes jobs run as expected (talking about time) and sometimes they were 'delayed' and had to run them again (as they exceeded the 3 days time, that is the limit time in my cluster). I did the parallelisation as you say, I split the sync files into chunks and then send several individual jobs.

ADD REPLY • link 6.8 years ago by dovi ▴ 60