Question

How to work with 2.6 TB genome data which is deposited in the server?

2

Entering edit mode

5.8 years ago

greyman ▴ 190

The bulk amount of fastq.gz files were stored in the server while i need to download them into my workfolder/laptop. However, i have only a 800GB external hardisk and a laptop with 500GB HDD.I have never handle such big data so far . According to my colleagues, the previous user used her laptop to run this crazy 2.6TB of data, sounds impossible to me... Anyone has any suggestion?

sequencing exome fastq • 1.8k views

ADD COMMENT • link updated 5.4 years ago by Biostar 20 • written 5.8 years ago by greyman ▴ 190

3

Entering edit mode

This isn't really a bioinformatics research question and may be closed because of that reason.

2.6 TB of raw data is conceivable for a large consortium project but sounds a bit suspicious for an individual project. Perhaps you are misinformed about the real size of the data. Have you actually checked the size of the data?

If it is already on a server somewhere then your best bet is to do all computing right there. It is going to be a fools errand to try and move that data to a local laptop.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

you are right thank you, someone use du -sh to tell me the file size wondering if it could be wrong

ADD REPLY • link 5.8 years ago by greyman ▴ 190

2

Entering edit mode

What is the data? - TCGA (cancer data)? I'm currently downloading 19 TB of data but I have gone through the correct procedure of contacting the head of IT to let him/her know and to get advice. Certainly not going to analyse it on a laptop.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

2

Entering edit mode

I had the fun of analyzing such TBs of data recently and even considering to do that on a laptop is insane (sorry, no offence to OP). It is not only the amount of raw data, but also the amount of intermediate files until you have your final output. Also, aligning these data with the 2, 4 or 8 threads on a laptop and maybe some 8 or 16GB RAM will take ages. As a crude benchmark, on our HPC cluster, trimming and aligning a paired-end WGS sample (human, 500mio paired-end reads, 2x100bp, each fastq.gz about 80GB at compression level 5) with 32 cores @2.4GHz and 120GB RAM, directly piped into sambamba sort and a duplicate marking tool, takes about 8 hours, if I remember correctly. On top, you have to add the time to process the data based on your scientific question. I am also not sure if it is healthy for your laptop to run at full load for weeks because that is probably the time it will take. Get in contact with your IT department and ask for advice. You need a proper server node for this and adequate storage solutions.

ADD REPLY • link 5.8 years ago by ATpoint 82k

1

Entering edit mode

Hi, thank you so much for your reply and i was asked to submit a grant proposal regarding CPU, RAM and HDD and probably a desktop in two days time. As the desktop gonna be placed at my personal workbench, with budget 7000 dollar, would it be possible to hear some advice from you in getting these hardware?

ADD REPLY • link 5.8 years ago by greyman ▴ 190

1

Entering edit mode

Does your institution own a high-performance cluster or do you really have to buy a desktop?

ADD REPLY • link 5.8 years ago by ATpoint 82k

0

Entering edit mode

They dont own a HPC but they are interested to do modification to a Lenovo desktop that i m currently using. The 7k dollar would be invested in the modification, however, i ll request for more fund if buying a new desktop is necessary. They are expecting to see some fruitful output from the 2.6TB before i could ask for more in next year, there are another 4 TB of data coming soon......Thank you so much for the prompt reply

ADD REPLY • link 5.8 years ago by greyman ▴ 190

1

Entering edit mode

You should at least invest in a decent dual socket xeon workstation with plenty of RAM. $7K may get you something but these workstations can become pricey and can cost a LOT more than $7K. Do you ever hope to work on all of that data at the same time because if you do, it would not be a great use of $7K to upgrade an existing desktop.

You should really be looking at cloud providers or other infrastructure to do this analysis properly.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks genomax!! i would like to work on all data in once but it seems like i have to separate them into batch ...start searching about dual socket workstation like HP Z820 Tower Workstation 16-Core E5-2670 (16 cores, hard disk storage15TB, RAM32 GB).... and looks like i ll have to add more RAM to it. ITS people from my place couldnt give me advice on this but they said will help me to set up. Also, did you mean looking for cloud providers to ask for their advice on this purchase or data analysis?

ADD REPLY • link 5.8 years ago by greyman ▴ 190

0

Entering edit mode

Thanks for the reply, will contact ITS asap!

ADD REPLY • link 5.8 years ago by greyman ▴ 190

1

Entering edit mode

That is a good move. Even if you have to spend a couple of days bargaining with these guys, it is time well invested, and will save you in the end a lot of time and nerves!

ADD REPLY • link 5.8 years ago by ATpoint 82k

0

Entering edit mode

Thank you, it is population disease data

ADD REPLY • link 5.8 years ago by greyman ▴ 190

1

Entering edit mode

Obviously you cannot download the whole data until you have a larger harddisk.

As always the question is what do you want to do with this data? Depending on this and what tools are needed to fulfill the task, it might be possible to only download a subset of the data or have remote access on the data.

fin swimmer

ADD REPLY • link 5.8 years ago by finswimmer 16k

0

Entering edit mode

Want like to start with fastqc to check the quality, so the subset sounds like a plan....remote access of data couldnt work because they dont want ppl to get too close to the server, some weird regulation

ADD REPLY • link 5.8 years ago by greyman ▴ 190

1

Entering edit mode

An external hard drive is cheap. One can easily get 4TB of storage for under $100 and 8TB for $150.

ADD REPLY • link 5.8 years ago by Eric Lim ★ 2.1k

3

Entering edit mode

While that is true on paper, question is does it make sense to try to download 2.6 TB of data to an external drive and then try to do analysis with it .. on a laptop.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

greyman : You don't need to close a post if you have received satisfactory answers. closing posts is an action used by mods to close posts that may be off-topic for this forum.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

thank you, still learning...

ADD REPLY • link 5.8 years ago by greyman ▴ 190

score 2 · Answer 1 · 2018-07-10

What you could do, and I have not thoroughly tested this at all (just had the idea), is to process the data on your machine (if it has the memory and CPU power for it), but reading the data from scp/stdin.

sshpass -p 'password' scp user@your.server.whatever.org:/scratch/tmp/input.fastq.gz /dev/stdout | bwa mem -p idx /dev/stdin | (...)

I gave this a quick test with an interleaved paired-end fastq file on our HPC server, processing it on my local machine and it worked (and worked means it gave alignments as expected for the 30'' I ran it). I don't know if scp is stable enough to run for several hours, but maybe this is an option for you. Using highest compression levels before writing to disk should also help, but will of course increase the processing time.