How to download FASTQ files from the European Nucleotide Archive (ENA) to use them with FastQC etc..
2
1
Entering edit mode
6 months ago
sil_bioinfo ▴ 40

Hello,

I would like to download FASTQ files from the European Nucleotide Archive (ENA) to use them with FastQC, kallisto,etc. In particular, this: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975 Since it's a huge amount of data, how could I do it if I don't even think I can download it on my pc? (I'm using a mac)

Thank you in advance!

ENA fastq R FastQC python • 1.3k views
ADD COMMENT
4
Entering edit mode
6 months ago
ATpoint 82k

Easiest is to enter the accession (PRJEB...) at https://sra-explorer.info/ and then get download links to execute via something like wget.

This returns below list, so save this as files.txt and then either use a loop or something like GNU parallel.

cat files.txt | parallel -j 2 "wget {}"

This will download 2 files at a time. Increase that if you have the I/O and bandwidth.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/001/ERR3258091/ERR3258091.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/002/ERR3258092/ERR3258092.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/006/ERR3258096/ERR3258096.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/000/ERR3258090/ERR3258090.fastq.gz
(...)
ADD COMMENT
0
Entering edit mode

Hello,

thank you for your quick response. I have the files.txt now (attached image), but I'm not sure what this means..

This will download 2 files at a time. Increase that if you have the I/O and bandwidth.

could you explain a bit please? Thank you in advance! enter image description here

ADD REPLY
1
Entering edit mode

The option -j in parallel decides how many jobs are executed in parallel. Downloading many files at a time only makes sense if you have a good internet connection and a harddrive that can consume much input traffic. Else, just download sequentially and wait.

ADD REPLY
0
Entering edit mode

Understood, thank you so much for your help!!

ADD REPLY
1
Entering edit mode

There is a total of 580+GB of raw data associated with this accession: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJEB31975&o=acc_s%3Aa

Make sure you have enough space/bandwidth before trying to download the entire set.

ADD REPLY
0
Entering edit mode

so, there is no way to use this kind of datasets without downloading them to the pc, right? (sorry if this is a very silly question).

ADD REPLY
1
Entering edit mode

It is a valid question. To analyze the raw sequence data you do need to download it locally or to a cloud computing environment or a compute resource at your local institution. There was an online web based tool that allowed use of SRA accessions. Unfortunately the name of the tool/lab is blanked out in my mind for now. If I recall/find I will post it.

These appear to be a lot of samples. If you want to learn you don't need to download all of the data. A couple of samples would be fine for FastQC/Kallisto/Salmon.

ADD REPLY
0
Entering edit mode

I'm going to use CESGA (Supercomputing center) services for this but since it's my first time trying to do this type of things, I'm just having a lot of questions. I think I will start using a few samples. Thank you so much for your help

ADD REPLY
0
Entering edit mode

depends what you mean with "use". You could download each fastq, use it + delete the fastq after the processing.

ADD REPLY
0
Entering edit mode

For "use" I mean, I want to create a ML model with this dataset to correct classify aTB & LTBI patients. And to preprocess this raw data (fastqc --> kallisto --> ...) since it's a huge amount of data I was asking myself how to do it. I think I will start using a few samples.. thank you so much for your help

ADD REPLY
1
Entering edit mode
6 months ago

use nf-core/fetchngs https://nf-co.re/fetchngs/1.11.0

ADD COMMENT

Login before adding your answer.

Traffic: 1413 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6