How to work with 2.6 TB genome data which is deposited in the server?
1
2
Entering edit mode
5.8 years ago
greyman ▴ 190

The bulk amount of fastq.gz files were stored in the server while i need to download them into my workfolder/laptop. However, i have only a 800GB external hardisk and a laptop with 500GB HDD.I have never handle such big data so far . According to my colleagues, the previous user used her laptop to run this crazy 2.6TB of data, sounds impossible to me... Anyone has any suggestion?

sequencing exome fastq • 1.8k views
ADD COMMENT
3
Entering edit mode

This isn't really a bioinformatics research question and may be closed because of that reason.

2.6 TB of raw data is conceivable for a large consortium project but sounds a bit suspicious for an individual project. Perhaps you are misinformed about the real size of the data. Have you actually checked the size of the data?

If it is already on a server somewhere then your best bet is to do all computing right there. It is going to be a fools errand to try and move that data to a local laptop.

ADD REPLY
0
Entering edit mode

you are right thank you, someone use du -sh to tell me the file size wondering if it could be wrong

ADD REPLY
2
Entering edit mode

What is the data? - TCGA (cancer data)? I'm currently downloading 19 TB of data but I have gone through the correct procedure of contacting the head of IT to let him/her know and to get advice. Certainly not going to analyse it on a laptop.

ADD REPLY
2
Entering edit mode

I had the fun of analyzing such TBs of data recently and even considering to do that on a laptop is insane (sorry, no offence to OP). It is not only the amount of raw data, but also the amount of intermediate files until you have your final output. Also, aligning these data with the 2, 4 or 8 threads on a laptop and maybe some 8 or 16GB RAM will take ages. As a crude benchmark, on our HPC cluster, trimming and aligning a paired-end WGS sample (human, 500mio paired-end reads, 2x100bp, each fastq.gz about 80GB at compression level 5) with 32 cores @2.4GHz and 120GB RAM, directly piped into sambamba sort and a duplicate marking tool, takes about 8 hours, if I remember correctly. On top, you have to add the time to process the data based on your scientific question. I am also not sure if it is healthy for your laptop to run at full load for weeks because that is probably the time it will take. Get in contact with your IT department and ask for advice. You need a proper server node for this and adequate storage solutions.

ADD REPLY
1
Entering edit mode

Hi, thank you so much for your reply and i was asked to submit a grant proposal regarding CPU, RAM and HDD and probably a desktop in two days time. As the desktop gonna be placed at my personal workbench, with budget 7000 dollar, would it be possible to hear some advice from you in getting these hardware?

ADD REPLY
1
Entering edit mode

Does your institution own a high-performance cluster or do you really have to buy a desktop?

ADD REPLY
0
Entering edit mode

They dont own a HPC but they are interested to do modification to a Lenovo desktop that i m currently using. The 7k dollar would be invested in the modification, however, i ll request for more fund if buying a new desktop is necessary. They are expecting to see some fruitful output from the 2.6TB before i could ask for more in next year, there are another 4 TB of data coming soon......Thank you so much for the prompt reply

ADD REPLY
1
Entering edit mode

You should at least invest in a decent dual socket xeon workstation with plenty of RAM. $7K may get you something but these workstations can become pricey and can cost a LOT more than $7K. Do you ever hope to work on all of that data at the same time because if you do, it would not be a great use of $7K to upgrade an existing desktop.

You should really be looking at cloud providers or other infrastructure to do this analysis properly.

ADD REPLY
0
Entering edit mode

Thanks genomax!! i would like to work on all data in once but it seems like i have to separate them into batch ...start searching about dual socket workstation like HP Z820 Tower Workstation 16-Core E5-2670 (16 cores, hard disk storage15TB, RAM32 GB).... and looks like i ll have to add more RAM to it. ITS people from my place couldnt give me advice on this but they said will help me to set up. Also, did you mean looking for cloud providers to ask for their advice on this purchase or data analysis?

ADD REPLY
0
Entering edit mode

Thanks for the reply, will contact ITS asap!

ADD REPLY
1
Entering edit mode

That is a good move. Even if you have to spend a couple of days bargaining with these guys, it is time well invested, and will save you in the end a lot of time and nerves!

ADD REPLY
0
Entering edit mode

Thank you, it is population disease data

ADD REPLY
1
Entering edit mode

Obviously you cannot download the whole data until you have a larger harddisk.

As always the question is what do you want to do with this data? Depending on this and what tools are needed to fulfill the task, it might be possible to only download a subset of the data or have remote access on the data.

fin swimmer

ADD REPLY
0
Entering edit mode

Want like to start with fastqc to check the quality, so the subset sounds like a plan....remote access of data couldnt work because they dont want ppl to get too close to the server, some weird regulation

ADD REPLY
1
Entering edit mode

An external hard drive is cheap. One can easily get 4TB of storage for under $100 and 8TB for $150.

ADD REPLY
3
Entering edit mode

While that is true on paper, question is does it make sense to try to download 2.6 TB of data to an external drive and then try to do analysis with it .. on a laptop.

ADD REPLY
0
Entering edit mode

greyman : You don't need to close a post if you have received satisfactory answers. closing posts is an action used by mods to close posts that may be off-topic for this forum.

ADD REPLY
0
Entering edit mode

thank you, still learning...

ADD REPLY
2
Entering edit mode
5.8 years ago
ATpoint 82k

What you could do, and I have not thoroughly tested this at all (just had the idea), is to process the data on your machine (if it has the memory and CPU power for it), but reading the data from scp/stdin.

sshpass -p 'password' scp user@your.server.whatever.org:/scratch/tmp/input.fastq.gz /dev/stdout | bwa mem -p idx /dev/stdin | (...)

I gave this a quick test with an interleaved paired-end fastq file on our HPC server, processing it on my local machine and it worked (and worked means it gave alignments as expected for the 30'' I ran it). I don't know if scp is stable enough to run for several hours, but maybe this is an option for you. Using highest compression levels before writing to disk should also help, but will of course increase the processing time.

ADD COMMENT
0
Entering edit mode

While this is creative solution (seriously) there is a few considerations, starting with the stability of the network, the traffic you cause on the network affecting all users on the network, the speed of the network compared to a local external drive. Analyzing 2.6TB without the proper resources should include planning with IT first (see Kevin Blighe's comment above)

ADD REPLY

Login before adding your answer.

Traffic: 3832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6