Download SAM/BAM files from SRA takes ages!!!
4
1
Entering edit mode
6.7 years ago

Dear all,

I have tried

 sam-dump SRR3330607 | samtools view -bS - > 42RF.bam

and

 sam-dump SRR3330607 > SRR3330607.sam

And it is taking hours to download, is there a more efficient way to download these files from SRA?

I want to use the bam files, and I have many other samples to download.

Thanks for your help. Alejandro

SRA bam • 9.1k views
ADD COMMENT
0
Entering edit mode

Are you able to get use fastq files (which you can then align yourself)? If so get them from EBI-ENA example.

ADD REPLY
0
Entering edit mode

It's quite likely that what you're getting is an unaligned BAM file, which is largely useless.

ADD REPLY
0
Entering edit mode

You can easily check fo alignment information in the sra run browser.

I had a conversation with the SRA team once where they explained to me that they really optimized SRA for generating FASTQ and running BLAST queries and not for generating SAM. I’ve noticed that SAM dump is usually slower than I’d like, but if you're truly getting alignment info I’m sure you’re saving time over aligning the FASTQ.

ADD REPLY
0
Entering edit mode

Nice, I'd never noticed the alignment window there before (it's unfortunate that this dataset used NCBI "chromosome names"). I guess I've never been trusting enough of what other people did to want to actually use their alignments...

ADD REPLY
0
Entering edit mode

Maybe you can try aspera.

ADD REPLY
1
Entering edit mode
6.7 years ago

To download SRA files I always use ascp, there's a manual here

It's ridiculously fast (the example command has a bandwith request of 100Mb/s, but I've used 400Mb/s before, depends on your local setup), then you can dump the fastq from the downloaded .sra file using the toolkit's fastq-dump --split-3)

ADD COMMENT
0
Entering edit mode

I've been looking for a way to increase the bandwidth beyond the usual 100Mb/s, how did you do it?

ADD REPLY
1
Entering edit mode

Do you have more than 100Mb/s available? Aspera will happily use all the bandwidth it can lay its hands on (up to 10 Gbps) as long as the source supports it (NCBI does).

ADD REPLY
2
Entering edit mode

I once broke the universities connection to backbone when downloading many files simultaneously on unthrottled ascp.

ADD REPLY
0
Entering edit mode

Uni of Sheffield?

ADD REPLY
0
Entering edit mode

No, Uni where I was a postdoc.

ADD REPLY
0
Entering edit mode

If you have aspera installed on your system, newer versions of SRA toolkit's prefetch command will automatically do an ascp transfer: https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data

ADD REPLY
1
Entering edit mode

My biggest problem with this is that for at least fastq data, the rate limiting step is generally the dump rather than the actual download.

ADD REPLY
0
Entering edit mode

Nothing you can do about it. If you have access to a SSD, it will speed up things but fastq-dump will always be slow. Especially on GPFS, where the random access slows down the system a lot If your file system is already slow in the first place, you will have a hard time. See if the data are mirrored at the European Nucleotide Archive ENA, which also supports Aspera download of fastq instead of sra.

ADD REPLY
0
Entering edit mode

I don't think random access is the problem. The SRA format is a column-oriented database, so there should be very little seeking when you're dumping FASTQ. I think the problem is in dumping SAM format you're encountering a slowdown because the SAM fields (alignment information) are not retrieved as efficiently as the sequence and quality scores.

ADD REPLY
0
Entering edit mode

The solution is to use ENA rather than SRA - everything apart from the controlled access stuff is mirrored accross and ENA store the raw fastq, which can be downloaded directly by ascp.

ADD REPLY
0
Entering edit mode

Is there a solution to this though?

ADD REPLY
1
Entering edit mode

You can dump “spots” 1 through n using one process, and n through k using another process. Basically run fastq-dump on the same SRA archive but exporting a different chunk of the fastq file. This will scale until you run out of disk IO or CPU threads.

ADD REPLY
0
Entering edit mode

I never thought of that.... what a great idea!

ADD REPLY
0
Entering edit mode

Did you check if the data are mirrored at the ENA?

ADD REPLY
1
Entering edit mode
6.2 years ago
sutturka ▴ 190

I would like to share my twist and turns with SRA download. It took several interactions with NCBI staff to figure this out. below are the steps:

Prerequisite:
sra-toolkit and aspera plugin installed. The instructions are specific to Linux environment.

Steps:

  1. Configure workspace The workspace for downloading the SRA data must be cached. Although SRA-toolkit is installed centrally, this need to be set manually for every user. Please follow this link and navigate to the section Configuring the Toolkit.

Assuming sra-toolkit is installed or loaded, run the following command and complete setup as mentioned in the link.

vdb-config -i

  1. Download SRA file

prefetch -X 200G SRR2095320 -a "/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/bin/ascp|/depot/bioinfo/apps/apps/aspera-connect-3.1.1.70545/etc/asperaweb_id_dsa.putty"

where "-a" specify the path for the aspera binary and private key file. Prefetch will download the SRA data as well as all needed references to your local cache. This prevents sending multiple requests to NCBI servers and save substantial time.

  1. Demultiplex with sra-toolkit

fastq-dump -I --split-files ./SRR2095320.sra

I found above approach extremely fast. using this approach 44GB file for SRR2095320.sra was downloaded, prefetched and converted to (151 GB R1 + 151 GB R2) data in about 25 hours using 10 processors. While using only the standard fastq-dump in >70 hours it could only convert (80 GB R1 + 80 GB R2) and job failed because of the connection issue.

ADD COMMENT
0
Entering edit mode
6.7 years ago
smho ▴ 40

If you are willing to download the FASTQ files instead and run the alignment yourself, here is a nice tutorial for using fastq-dump correctly:

https://edwards.sdsu.edu/research/fastq-dump/

You can even select to download as FASTA (without quality scores) with --fasta option to reduce the download volume; however, probably not much recommended! :-)

ADD COMMENT
0
Entering edit mode
5.5 years ago
shengwei ▴ 40

check here for an example using Aspera Connect (ascp):

  1. To download

    prefetch --max-size 100G --transport ascp --ascp-path "/path/to/aspera/3.6.2/bin/ascp|/path/to/aspera/3.6.2/etc/asperaweb_id_dsa.openssh" --output-file $OUTPUT_DIR/$SRA_ID.sra $SRA_ID

  2. To extract

    fastq-dump --split-files --origfmt --gzip $OUTPUT_DIR/$SRA_ID.sra

ADD COMMENT
1
Entering edit mode

This solution has been posted and upvoted in this thread twice already ;-)

ADD REPLY

Login before adding your answer.

Traffic: 2071 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6