Question

Why less number of hadoop/spark based de novo assemblers?

1

Entering edit mode

7.5 years ago

saranpons3 ▴ 70

As the reads data set which has to be assembled by de novo assembler is large, the assembly problem can be considered as a big data problem (if wrong, please correct me). Nowadays, people started using big data technologies such as hadoop and spark for dealing with big data problems. When I started looking for hadoop/spark based denovo genome assemblers i came across Contrail, Cloudbrush and Spaler. Contrail was developed in 2010, Cloudbrush was developed in 2012 and Spaler was developed in 2015. Besides these 3 big data technology based assemblers, To the best of my knowledge, there are no other hadoop/spark based denovo genome assemblers. As there are just three hadoop/spark based denovo genome assembler, this has raised a question in me that Is hadoop/MapReduce/Spark good technology for denovo based genome assembler? I'ld be grateful if somebody can throw some light on this. Thanks in advance.

Hadoop Spark Big data De novo Assembly • 1.9k views

ADD COMMENT • link updated 7.5 years ago by Asaf 10k • written 7.5 years ago by saranpons3 ▴ 70

3

Entering edit mode

See also this: Why is Hadoop not used a lot in bio-informatics?. I do think hadoop-based de novo assembly makes sense. It is good for a proof-of-concept type of project, but will be used by no one in practice because 1) single-machine assemblers are already good enough and 2) few labs who actually do such assembly use hadoop.

ADD REPLY • link 7.5 years ago by lh3 33k

0

Entering edit mode

You may want to read this: Genomics is not Special. Computational Biologists are reinventing the wheel for big data biology analysis

ADD REPLY • link 7.5 years ago by pld 5.1k

score 0 · Answer 1 · 2016-10-18

0

Entering edit mode

7.5 years ago

Asaf 10k

Because you don't need distributed and fast storage, you want to save all your data in RAM.

ADD COMMENT • link 7.5 years ago by Asaf 10k