Question

How to download FASTQ files from the European Nucleotide Archive (ENA) to use them with FastQC etc..

1

Entering edit mode

6 months ago

sil_bioinfo ▴ 40

Hello,

I would like to download FASTQ files from the European Nucleotide Archive (ENA) to use them with FastQC, kallisto,etc. In particular, this: https://www.ebi.ac.uk/ena/browser/view/PRJEB31975 Since it's a huge amount of data, how could I do it if I don't even think I can download it on my pc? (I'm using a mac)

Thank you in advance!

ENA fastq R FastQC python • 1.3k views

ADD COMMENT • link 6 months ago by sil_bioinfo ▴ 40

1

Entering edit mode

6 months ago

Pierre Lindenbaum 161k

use nf-core/fetchngs https://nf-co.re/fetchngs/1.11.0

ADD COMMENT • link 6 months ago by Pierre Lindenbaum 161k

score 4 · Accepted Answer · 2023-11-17

4

Entering edit mode

6 months ago

ATpoint 82k

Easiest is to enter the accession (PRJEB...) at https://sra-explorer.info/ and then get download links to execute via something like wget.

This returns below list, so save this as files.txt and then either use a loop or something like GNU parallel.

cat files.txt | parallel -j 2 "wget {}"

This will download 2 files at a time. Increase that if you have the I/O and bandwidth.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/001/ERR3258091/ERR3258091.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/002/ERR3258092/ERR3258092.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/006/ERR3258096/ERR3258096.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR325/000/ERR3258090/ERR3258090.fastq.gz
(...)

ADD COMMENT • link 6 months ago by ATpoint 82k

0

Entering edit mode

Hello,

thank you for your quick response. I have the files.txt now (attached image), but I'm not sure what this means..

This will download 2 files at a time. Increase that if you have the I/O and bandwidth.

could you explain a bit please? Thank you in advance! enter image description here

ADD REPLY • link 6 months ago by sil_bioinfo ▴ 40

1

Entering edit mode

The option -j in parallel decides how many jobs are executed in parallel. Downloading many files at a time only makes sense if you have a good internet connection and a harddrive that can consume much input traffic. Else, just download sequentially and wait.

ADD REPLY • link 6 months ago by ATpoint 82k

0

Entering edit mode

Understood, thank you so much for your help!!

ADD REPLY • link 6 months ago by sil_bioinfo ▴ 40

1

Entering edit mode

There is a total of 580+GB of raw data associated with this accession: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJEB31975&o=acc_s%3Aa

Make sure you have enough space/bandwidth before trying to download the entire set.

ADD REPLY • link 6 months ago by GenoMax 142k

0

Entering edit mode

so, there is no way to use this kind of datasets without downloading them to the pc, right? (sorry if this is a very silly question).

ADD REPLY • link 6 months ago by sil_bioinfo ▴ 40

1

Entering edit mode

It is a valid question. To analyze the raw sequence data you do need to download it locally or to a cloud computing environment or a compute resource at your local institution. There was an online web based tool that allowed use of SRA accessions. Unfortunately the name of the tool/lab is blanked out in my mind for now. If I recall/find I will post it.

These appear to be a lot of samples. If you want to learn you don't need to download all of the data. A couple of samples would be fine for FastQC/Kallisto/Salmon.

ADD REPLY • link 6 months ago by GenoMax 142k

0

Entering edit mode

I'm going to use CESGA (Supercomputing center) services for this but since it's my first time trying to do this type of things, I'm just having a lot of questions. I think I will start using a few samples. Thank you so much for your help

ADD REPLY • link 6 months ago by sil_bioinfo ▴ 40

0

Entering edit mode

depends what you mean with "use". You could download each fastq, use it + delete the fastq after the processing.

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

For "use" I mean, I want to create a ML model with this dataset to correct classify aTB & LTBI patients. And to preprocess this raw data (fastqc --> kallisto --> ...) since it's a huge amount of data I was asking myself how to do it. I think I will start using a few samples.. thank you so much for your help

ADD REPLY • link 6 months ago by sil_bioinfo ▴ 40