Question

How To Access Bam Files Directly Via Hadoop?

0

Entering edit mode

11.6 years ago

jtal04 • 0

Has anyone used the 1000 genomes public data set available on Amazon s3?

Or, I should ask- has anyone used the BAM files directly via an AWS service such as elastic mapreduce?

I can download the files to EBS, unpack them, and reupload them to s3 but that is more expensive (and more work) than the public/free copy.

Thank you for any insight, Justin

Edit: I am currently looking into hadoop-bam http://sourceforge.net/projects/hadoop-bam/

bam 1000genomes • 3.1k views

ADD COMMENT • link updated 11.6 years ago by Mikael Huss 4.8k • written 11.6 years ago by jtal04 • 0

score 1 · Answer 1 · 2012-09-10

1

Entering edit mode

11.6 years ago

Mikael Huss 4.8k

You already answered your own question - you could use Hadoop-BAM. You might also want to check out SeqPig, which lets you perform Pig queries against your BAM files (and other things).

ADD COMMENT • link 11.6 years ago by Mikael Huss 4.8k

score 0 · Answer 2 · 2012-09-10

0

Entering edit mode

11.6 years ago

JC 13k

did you read and try the tutorial? http://www.1000genomes.org/using-1000-genomes-data-amazon-web-service-cloud

ADD COMMENT • link 11.6 years ago by JC 13k

0

Entering edit mode

Yes. It says to access the data the generic way you access s3 data. Next it describes how to start up an ec2 image from their AMI and the remaining is a tutorial that is not specific to AWS. I guess the title of my question is misleading- my real problem is trying to use bam files in elastic mapreduce, which doesnt know how to split records in a bam file.