Bulk download of entire BioProject SRA
2
4
Entering edit mode
5.3 years ago
Anand Rao ▴ 630

I am trying to download entire dataset for a bioproject using esearch and efetch from the Entrez Utilities.

My syntax is based on syntax posted by @Istvan Albert at C: How to download raw sequence data from GEO/SRA, which is

esearch -db sra -query PRJNA40075 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

For the BioProject PRJNA269201 I am interested in, slightly truncated syntax as shown below, creates 144 empty files as expected:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs touch

However, when I try the full-length syntax, it behaves differently from what I expected under both scenarios 1 and 2 detailed below:

Scenario 1. On head-node of a cluster:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -2 | xargs fastq-dump --split-files

one file finished download, but it is 5.5G which is way larger than the 1.2GB I expected based on info at this link - is this difference because of file compression?! How can I download to a much more compressed version for both storage and downstream RNA-Seq analyses?

-rw-rw-r-- 1 aksrao aksrao 1.1G Jan 19 19:47 SRR1726554_1.fastq

-rw-rw-r-- 1 aksrao aksrao 5.5G Jan 19 19:44 SRR1726553_1.fastq

Scenario 2. When I try to submit this as a shell script, the STDERR stream (SLURM queue management on UBUNTU cluster) captures the following error message:

2019-01-20T02:28:55 fastq-dump.2.8.2 err: param empty while validating argument list - expected accession

This same problem was reported on the original post by user @ bandanaschapagain, but it may not have been answered and resolved, hence I am posting this afresh. Could someone please help me? Thank you!

SRA NCBI sratoolkit entrez utilities BioProject • 9.3k views
ADD COMMENT
7
Entering edit mode
5.2 years ago

Download the RunInfo table and use parallel to download multiple files at once.

#!/bin/bash
#change the number after  -j change the number of files to be processed.
parallel --verbose -j 20 prefetch {} ::: $(cut -f5 SraRunTable.txt ) >>sra_download.log
wait
parallel --verbose -j 20 fastq-dump --split-files {} ::: $(cut -f5 SraRunTable.txt ) >>sra_dump.log
wait
exit
ADD COMMENT
4
Entering edit mode

I would always avoid fastq-dump to directly load files from the SRA as it tends to be unstable. Better download the SRA files to disk with prefetch and then use fastq-dump on them, given that the data are not backed-up at the ENA in fastq format directly.

ADD REPLY
1
Entering edit mode

No need for wait. GNU Parallel does that for you.

ADD REPLY
4
Entering edit mode
5.2 years ago

I don't think you're doing anything wrong; the first run (SRR1726554) matches what's on SRA (1.1G). I downloaded SRR1726553 myself and also got a .fastq file of 5.6G. It could be that the SRA metadata is wrong; I would contact them and ask for more information (e-mail at sra@ncbi.nlm.nih.gov).

You can get a compressed version by calling --gzip in your fastq-dump calls. Most aligners will accept gzipped fastq files as input. My full fastq-dump command is:
fastq-dump $SRA_FILE --outdir $SRA_DIR --gzip --skip-technical --readids --dumpbase --split-files --clip

I would recommend reading the Edward's lab fastq-dump article to learn more about some useful options.

For your error message in Scenario 2; I suspect your accession is not getting passed correctly (based on the expected accession part.) Maybe add some print calls (like wrapping the fastq-dump part with an echo and writing it to a file) in your shell script to see what command it's actually trying execute? The error message seems rather rare, so maybe it's worth asking SRA support about that too.

ADD COMMENT

Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6