Question

Error with fastq file corruption when downloading from SRA explorer using SLURM job

0

Entering edit mode

3 months ago

biotrekker ▴ 100

Hello, I am trying to download fastq files (both single cell and bulk rna-seq) virtually. I am first creating a screen and then using SLURM to start a job. Next I use SRA explorer's bash script (which I save as a bash script) and run it to try to get the fastq files. It seems that some files are corrupt. I can see this from downstream processing with other tools or sometimes, simply by checking file size. Each time a rerun the same bash script from SRA Explorer a DIFFERENT file is corrupted. When I individually run the curl command (from the bash script) for a corrupted fastq file, the file is downloaded properly

What is going on? And what steps should I take to prevent this from happening?

Thank you

EDIT: WALLTIME used was >200 hours which is sufficient for the job, its not the samples at the end were corrupt or incomplete its like some random random sample

SRAExplorer corruption rna-seq SLURM • 666 views

ADD COMMENT • link 3 months ago by biotrekker ▴ 100

score 1 · Answer 1 · 2024-01-05

1

Entering edit mode

3 months ago

Philipp Bayer 8.3k

My guess is that when you submit the SLURM job, you (accidentally?) specify a fairly short walltime and SLURM kills the job after a few minutes/hours, which leaves behind an incomplete or corrupt file. That would explain why the download works fine outside of SLURM. You can check how much of your walltime was used using the seff command: seff JOBID. It will probably say that the job used something like 2 out of 2 allocated hours. The solution would be to designate longer walltime when starting the SLURM job, or perhaps submitting one SLURM job per file to download.

ADD COMMENT • link 3 months ago by Philipp Bayer 8.3k

1

Entering edit mode

Seconding this. Downside of tools like curl and wget is that they directly write to the final output file. That means, if output is foo.fq.gz then they write to this file and during download file size just gets bigger until finished. Hence, if download fails prematurely, then the file is present but incomplete. Without md5sum hard to diagnose ad hoc. So, what you can do is:

1) Check your slurm logs. There should be a log output file per job, and if you got timed-out then there will be a message for that, so you know which job went incomplete.

2) Instead of curl or wget, use the Aspera download links from sra-explorer.info. This only creates the visible final output file if download finished successfully. See for setup: Setting up Aspera Connect (ascp) on Linux and macOS

3) Not sure I should recommend this, but I usually do downloads (if they filish within minutes, not downloads that take hours) via the head node. It doesn't consume notable memory or CPUs and I don't get automatically killed on our cluster when doing it. It's probably bad advise because head node should be taboo- but as long as it works...so what... But anyway, better try the other stuff first.

ADD REPLY • link 3 months ago by ATpoint 82k

0

Entering edit mode

I think, running pure IO-jobs on login nodes is ok on most clusters and there isn't any advantage in moving the download to a compute node. On some, they explicitly state that the login nodes are for compiling, moving data etc. So I think its ok to use nohup to download raw data.

ADD REPLY • link 3 months ago by Michael 54k

0

Entering edit mode

GNU screen over nohup all day :)

ADD REPLY • link 3 months ago by ATpoint 82k

0

Entering edit mode

I guess all according to taste, running simple nohup is more "conservative". It wouldn't be the first time I canceled a long-running process by accidentally pressing Ctrl-C to "get out" of the "log viewer", or forgot to turn on logging; but then again there are always pros and cons https://stackoverflow.com/questions/20766300/nohup-vs-screen-which-is-better-for-long-running-process

The only reason where one must use screen in my understanding is if you have to interact with the process later on, like entering passwords and so on.

ADD REPLY • link 3 months ago by Michael 54k

0

Entering edit mode

WALLTIME used was >200 hours which is sufficient for the job, its not the samples at the end were corrupt or incomplete its like some random random sample.

ADD REPLY • link 3 months ago by biotrekker ▴ 100