Question

Amzon EC2 for bioinformatics, genomics, NGS analysis

4

Entering edit mode

8.1 years ago

katie.duryea ▴ 40

Hi,

I am starting a new thread on this topic because I can't find one that is less that a year old. Does any one have advice on using Amazon EC2 for bioinformatic analysis, specifically RNA seq data? It seems there are many instance options these days and I would appreciate some feedback on how to choose among them and if there are any instances specifically tailored for this type of analysis.

Thanks in advance for any input!

Katie

RNA-Seq next-gen • 7.5k views

ADD COMMENT • link updated 8.1 years ago by Damian Kao 16k • written 8.1 years ago by katie.duryea ▴ 40

0

Entering edit mode

You should specify the scale of the data that you wish to solve. What type of data and how much of it?

ADD REPLY • link 8.1 years ago by Istvan Albert 100k

score 4 · Answer 1 · 2016-03-10

Warning, AWS rant approaching: I've been using EC2 for about 6-7 years, and before that I had 4 servers colocated across the UK for some websites that I managed. EC2 provided me with what looked like a really nice set of value-add features, at roughly the same price as my existing hosting costs, which is why i made the switch. For example, installing a new server to cope with Christmas Holiday load was something i'll never forget as being the most stressful time of my early-twenties, because that meant new hardware. Do i lease it or buy it? If I buy it, i have to source compatible parts and make sure its under the wattage requirements. Then i have to physically get it there, get security clearance, wire it up without unplugging anyone else's stuff, configure the network/firewall, updates, etc etc. And then maybe the site gets no traffic on top of what we expected. Or maybe it still all chokes up and falls over. Or maybe that new heatsink wasn't fitted right. Or maybe someone else in the rack adds another server and pulls out my ethernet cable. Managing real hardware was a huge pain bundled with a lot of uncertainty. The most consistent thing about hosting was the price - you paid for the physical space, and the size of your network pipe.

Then EC2 came along and took all those troubles away for me, and instead I just get billed for what I use - which seemed like a fair deal and massively simplified my life. Sure the month-on-month hosting costs are higher, but you don't have to buy hardware and EC2 is so flexible for scaling up or down. Also, I can do all my IP/domain name/load balancing using their GUI, which was and still is so much better than anything built in to a CISCO router or networksolutions.com has to offer. Spreading static content over their CDN is easypeasy. High Availability means customers are never sat staring at an empty page, and I never get calls at 3am. Making secure databases for customer credit card numbers etc is hassle free, and I know if anything goes wrong I can say I did my bit, blame Amazon. They now do SSL certification too, so thats something i'll be moving over soon too.

So what does any of this have to do with Bioinformatics? Well... nothing. And that's my point really. Most of the services AWS provides have no relevance on how bioinformatic data is processed, but those features jack up the prices because people like myself are willing to pay a little extra on the hosting/storage costs for convenience. If you're not using those services, you think you're not paying for them -- but it's factored in to your hosting costs for sure. Yes you can spin up a compute server and "only pay what you use" - but factored into that cost is the convenience of being able to have immediate access in the first place. I say immediate. It's as immediate as 5MB/s to transfer your data to the server from a local source.

But there is another way to have truly immediate access, and no costs for usage or storage, and a simpler interface. Buy a small compute server with 16cpus, 124Gb of RAM, 1Tb of SSD space and 10Tb of HDD space. The cost for all of this, measured in AWS-years, is about 1.5 years. If you plan on doing Bioinformatics for longer than this, you'd be a lot better off in my opinion just buying the hardware and sticking it under your desk.

There are exceptions. If your usage is very 'spikey', like you need 100cores NOW and none tomorrow, then EC2 will give you that. If you need 50cores today and tomorrow, we'll you'll be paying the same amount as the other guy, and not benefiting from any of that convenience.

In my opinion, the next big breakthrough in bioinformatics will have nothing to do with clouds. It's far more likely to be something to do with clusters of small cheap cpus (like the PINE A64), like RAID was for hard drives - or it will be modern languages like Julia or Rust that work with all the low-level vector instructions the CPUs could use if only the software knew how to use them. Honestly, I have no idea why anyone would pay Amazon for a service that you can provide yourself with at home.

score 3 · Answer 2 · 2016-03-10

3

Entering edit mode

8.1 years ago

andrew.j.skelton73 6.5k

This all depends on your problem I guess. If you're performing RNA Seq analysis, then decide what pipeline you're going to follow and build from there. I'd start with a generic EC2 Ubuntu instance, and install the tools then save that image to use for your analysis.

It depends how generic you want this to be really. For the sake of filling in the gaps, say you have a single RNA seq experiment to carry out, 30Million reads, Paired End, two conditions in triplicate, and the aim is a differential expression analysis. I'd start by creating an instance that has Salmon, Kallisto, STAR, and HTSeq_Count installed. For each sample, I'd align with STAR, Quantify with Salmon and Kallisto, run HTSeq Count post alignment, with the end game being Gene level counts from HTSeq Count, Salmon Counts and Kallisto Counts, per sample. From there I'd work with that data in an R session to test the hypothesis. This is very much a toy example, but my point is that it all depends on how you want to analyse your data.

ADD COMMENT • link 8.1 years ago by andrew.j.skelton73 6.5k

1

Entering edit mode

Andrew, I'm curious why you recommend running three different quantification methods/tools. Is it to see if the results are consonant, or for some other reason?

ADD REPLY • link 8.1 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

Fair point. Kallisto and Salmon + Sleuth are the tools we're testing out for differential transcript expression at the moment, I say both because they do have differences in their methodologies, so getting a consensus is generally reassuring. I'd also use the transcript level counts and feed them into DESeq2, to compare that to the straight and reliable methodology of HTSeq Count gene level counts. It's all personal preference but generally I like to have a cross tool consensus to add to my confidence in what I'm seeing.

ADD REPLY • link 8.1 years ago by andrew.j.skelton73 6.5k

0

Entering edit mode

Excellent answer. I'd like to add that all your installation should take place in a Amazon EC2 Free Tier instance. You can then create an image and move it on over to your actual work instance. This will save you money when you're just starting out.

ADD REPLY • link 8.1 years ago by Sinji ★ 3.2k

score 1 · Answer 3 · 2016-03-10

You could also provision a bcbio cloud / cluster on top of amazon.

This has the advantage that all necessary (RNA-seq) tools are installed and that you can run common RNA-seq pipelines directly.

http://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html

http://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#rna-seq

I haven't used this but it's what I would consider if I had to run a (RNA) seq analysis that's to large to run on local compute.

score 1 · Answer 4 · 2016-03-10

I mainly use ec2 for jobs where I don't want to use my local HPC for whatever reason. I setup an ad hoc cluster with StarCluster (http://star.mit.edu/cluster/) and run my jobs either using the pre-install sun grid engine or I use gnu parallel to run across nodes.

I generally use it for:

1) Large eukaryotic genome assembly jobs where distributed memory is needed.

2) Tasks where it might take multiple days, but I want it done in a few hours.

Even using spot-instances, the costs can add up. So buyer beware.

Protip: cc2.8xlarge instances (doesn't exist in newer regions) are usually under-utilized because they are older machines.