Biostar Beta. Not for public use.
What is the computation requirement to process whole genome paired end data each fastq files being 170G?
Entering edit mode
15 months ago
shuksi1984 • 50

How much HDD, RAM, and internet speed is required to process whole genome paired end data each fastq files being 170G? My machine has following configuration:



But its taking >1hr for the following simple command to process:

 wc SRR6876052_1.fastq -l

Also, brief me about online server.

Entering edit mode
9 weeks ago
genomax 68k
United States

Your question can use some clarity. How many of these files are you referring to. Just two for one sample or more?

I am not sure why you need internet connectivity to process genome data (assuming you have the reference downloaded and indexed).

RAM is going to be limiting if this is human genome (or similar sized) data. You need ~30G of free RAM with many of the aligners. Your best bet may be bwa which is one of the lightest memory requirement aligner (~6G free for human data).

Counting lines in a fastq files can't be realistically be considered processing data and doesn't give you any idea of how long it may take to scan/trim/align the data. You should also keep the fastq files compressed to save space. Most NGS programs understand compressed data and will work with compressed data seamlessly.

You can look into Amazon AWS and Google Compute to get an idea of pricing for online compute resources.

Entering edit mode

It is paired end genomic data for single human sample. Thus, two fastq files of 170G each. Internet connectivity is required only to download dataset.

Entering edit mode
12 months ago
WCIP | Glasgow | UK

Just a rough estimate... You have 340 GB of fastq (170 x 2, I assume this is uncompressed). Aligned and in BAM format this may be ~50 GB. To sort it you need another 50 GB for the temporary files and 50 GB for the final aligned BAM. You could pipe the output of the aligner (say bwa mem) into samtools sort so you save time and ~50 GB of disk space. Once done, you can delete fastq and unligned bam, if any, and you finish with ~50GB of BAM and ~1/2 TB peak disk space usage. Of course, 340 GB uncompressed could be reduced to maybe even 1/10 of that size with gzip.

Entering edit mode
5 weeks ago
ATpoint 17k

For a very rough estimate:

On a Broadwell Xeon node (2.4GHz I think) with 128GB RAM, processing a 2x100bp WGS sample with 635.658.231 read pairs, using BWA mem with 24 threads, piped into SAMBLASTER for duplicate marking, and a sort with SAMBAMBA with 30G memory usage, it takes 7-9 hours. As you are limited to 16GB RAM, you'll probably need to limit BWA to 8 threads or so, if your machine has that capacity. Still, it will probably take an entire day.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1