Picking A Programming Language And Where To Begin
11
13
Entering edit mode
12.7 years ago
Travis ★ 2.8k

Hi,

I traditionally use Perl but as I am exposed to more and more NGS type work it seems that it may be time to learn a 'programming language'.

Does C++ seem like a good option?

If so can anyone recommend a good starting point? I've done Perl for the past 8 years and had a little experience in Java prior to that so the starting point should be tuned to that...

programming next-gen sequencing c subjective • 8.5k views
ADD COMMENT
31
Entering edit mode
12.7 years ago
brentp 24k

As an alternative to the mostly logical viewpoints already posted, I'll add this whimsical one:

Use whatever language makes you happy.

You'll likely spend a lot of time writing, re-writing, reading and WTF'ing through your code, so if Java via NetBeans or Eclipse does it for you, then go that way. If C via Vim makes you giggle, then that might be a good choice.

In my experience, most "slow" languages will offer you a way to improve speed using libraries or language constructs.

Since you already have perl as a scripting language, it's often good to follow the libraries. For example, R has very good libraries for microarray's, if you do a lot of that, doing R+bioconductor will save you from reinventing a complicated wheel.

In the long-run, it can't hurt to learn more languages.

ADD COMMENT
4
Entering edit mode

can I vote twice ?

ADD REPLY
2
Entering edit mode

You mean computing is meant to be fun?!? x_o --> +1

ADD REPLY
1
Entering edit mode

.............+1

ADD REPLY
1
Entering edit mode

Inspiring! From now on, I'm writing all my programs in なでしこ! ;)

ADD REPLY
0
Entering edit mode

Claiming that programming should be related to programmer's happiness is a TOTAL lack of respect to a lot of people dedicated to computer programming pedagogy. And YES it hurts to learn programming languages, because 1) you need TIME, and 2) you must un-learn all the bad habits of crappy programming languages. It is really SAD how experts bioinformaticians support this naive opinion of it's ok to learn everything or anything. For the love of all gods, take seriously what you put in your head, learn how and why some programming languages were invented.

ADD REPLY
0
Entering edit mode

@michaelfaraday, I can't tell if you're agreeing or disagreeing (likely the latter). Can you elaborate on what you think is SAD? in the context of what you're replying to above?

ADD REPLY
0
Entering edit mode

@michaelfaraday What languages do you use that are so perfect and serious?

Why would people DO programming if it didn't make them happy? What is so wrong with enjoying what you're doing?

"it's ok to learn everything or anything" Sounds like a good life to me.

ADD REPLY
22
Entering edit mode
12.7 years ago
lh3 33k

If you want to go into the area of algorithm development, learning a compiled language is necessary; otherwise, you can stick to Perl. When you come to compiled languages, C/C++ and Java are the only viable options for beginners.

Personally, my worry about Java was: what if some day I had to write an extremely fast program requiring huge memory? When that day came, I would have to fall back to C/C++ because Java is still slower, even though not much, and is a famous memory hog. Thus several years ago, I recommended C/C++ to all my friends. Nowadays, I realize that >95% of people do not ever want to write something extremely fast or that take huge memory. Furthermore, although when we are proficient in C/C++, the development time in C/C++ and Java do not differ much, it takes much longer to be a reasonably good C/C++ programmer than a Java programmer. Many people stop before they get there. Now I think 95% of people who want to learn a compiled language should go for Java. C/C++ is only preferable for those <5% of people who care about extreme efficiency.

Below are some of my other views towards programming languages, which may help or may not.

  1. C/C++ vs. D/Go. C/C++ is still the best choice if you really want something efficient. However, the most nasty "feature" of all is the unsafe memory access. Even experienced programmers with an appropriate tool may spend a long time on pinpointing a tiny out-of-boundary bug. Memory leak is also frequently annoying, though easier to solve. D and Go largely resolve this major issue in C/C++. They are much richer in features and easier to work with. They have learned the advantages of most modern programming languages. However, because they are relatively new, there are essentially no 3rd-party libraries in Bioinformatics. You have to do everything from scratch. If you are choosing the first compiled language, D and Go are not an option. They are more suitable for experienced programmers who want to explore something exciting.

  2. C vs. C++. I prefer C over C++. The latter I think is unnecessarily complicated. Nonetheless, I do agree that C++ helps to organize a large project. C++ has a more friendly library interfaces and this is why you tend to have more well-formed libraries in C++ than in C. Also, even if you choose C, you should learn the basic of C++. This is necessary when you want to understand C++ programs written by others and may help you to design your own projects.

  3. Java vs. C++. If we look at the language implementation itself, Java is about ~1.8X as slow as C/C++, which is consistent in a couple of benchmarks. Being twice as slow is not a big issue for most of tasks. In addition, several benchmarks have pointed out that it is easier to write inefficient programs in C++ than in Java because C++ has more pitfalls related to the use of libraries -- if you learn C++ and Java each for one month, your Java programs may look more professional and be actually faster. However, the major problem with Java is its memory management. Java is a memory hog especially when you are not very good at it yet. When your program requires a lot of memory, Java will make it much worse. That is why almost no NGS read mappers are written in Java.

  4. C/C++/Java vs. Python/Perl. For text processing, python and perl are not far from those compiled languages in speed. Another post in this forum (I could not find the link) shows that when you do not use C++ properly, your long C++ program can be times slower than a much simpler Python/Perl script. Furthermore, Python has a JIT compiler Pypy, which may make your python script 10-100X faster for computation intensive tasks (e.g. Smith-Waterman alignment and numerical things).

EDIT: just saw brentp's answer. He does have a good point which I second. Picking up a programming language is sort of like a marriage: you choose the language and the language chooses you, too. If you do not fit a programming language or vice versa, go for something else and do not push yourself too hard.

ADD COMMENT
4
Entering edit mode

Have a primary language and keep the rest as mistresses (or paramours) :)

ADD REPLY
2
Entering edit mode

I like the marriage metaphor ... but just one problem, I am in love with quite a few languages :)

ADD REPLY
12
Entering edit mode
12.7 years ago

I found that there are no slow languages only ineffectively written programs. I don't think C++ is more of a programming language than Perl and for the majority of cases you will get the job done faster with Perl than C++ (total time = development + runtime)

I would advise to do the exact opposite, learn concepts at a higher level, become familiar with libraries, packages, samtools, bedtools, various bridges to R, learn how to interface and integrate these into Perl. I would steer away from trying to implementing in a lower level language.

ADD COMMENT
1
Entering edit mode

Every answer is different and every answer makes sense :) I guess R would make sense too since I don't know a statistical language.

ADD REPLY
0
Entering edit mode

If you go with R, check out the Rcpp and inline packages, they are "easy" bridges to mix R with c/c++ code which can prove to be very powerful.

ADD REPLY
11
Entering edit mode
12.7 years ago
Farhat ★ 2.9k

What kind of work are you doing with NGS? C++ is a good option, as is C, but I would suggest Python if speed is not an absolute concern. Python is syntactically cleaner to work with compared to Perl and some of the object-oriented concepts from Java will translate to Python (though not all).

ADD COMMENT
4
Entering edit mode

I'd agree. The speed-up you might get from C++ comes at the cost of having to learn, write, and maintain C++ code. Also, C++ is zero fun when it comes to tasks like text manipulation (something you will no doubt be doing constantly with NGS data). If you really need a compiled language, go with Java. The included data structures, string processing tools, and memory management will make development go much faster. The idea that Java is too slow is an old myth that just won't die. Picard, GATK, snpEff, IGVTools - all NGS tools in Java. Download Eclipse, google 'learning java', and get going.

ADD REPLY
2
Entering edit mode

The GATK which is widely used pipeline for SNP detection is written in Java/Scala. http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

ADD REPLY
0
Entering edit mode

Currently working on SNP/Indel and RNA-Seq type work. Using existing tools only but I would like to give myself the option of editing existing programs or creating my own. I would like to try something which is a complete departure from Perl. Python seems similar in some ways. Any good starting points for C/C++?

ADD REPLY
9
Entering edit mode
12.7 years ago
Rm 8.3k

Apart from C++ as suggested by above all, I would suggest to have "awk" skills for easy handling/analysis of NGS data.

ADD COMMENT
8
Entering edit mode
12.2 years ago
Gmoney ▴ 220

As mentioned above, use the language that you feel most comfortable with.

That being said, I have become a generalist as time goes on.

I use R to make nice graphs, python for automation/text sifting, php for webifying my scripts so other scientists can use them. By using many tools, you can lean on individual language strengths, and not use them in places that they seem weak.

R's I/O speed can be dreadful, so beforehand I would use python to reformat/clean data, then feed it to R for pretty picturing. I have called C/C++ programs from python to benefit from their speed if some serious computation needs to be done.

In bioinformatics, the results are what matter, not the language used. To me, a stack of programs each with a strength is also good for maintainability. You can make sure the formats farther up the pipeline stay the same, so that you can replace parts as they get better.

A one size fits all is definitely no good for a field moving so fast.

Disclaimer: I'm a C guy who used to use Perl, and now uses Python.

ADD COMMENT
0
Entering edit mode

Completely agree. Learn programming logic, not syntax. If you understand the logic, syntax is pretty easy to pick up.

ADD REPLY
7
Entering edit mode
12.7 years ago
Samuel Lampa ★ 1.3k

Very recently having been in a similar situation as you, I've decided to:

  1. Use Java for the time being
  2. At the same time keeping an eye on the progress of the D programming language.

Why?

I'm going with Java, since as somebody pointed out, there's really a lot of tools already available in Java, Java has over the years come to have very good speed (I've heard things like 90% of the speed of C/C++ code), and you get going much faster since you have garbage collection and a much cleaner language than C/C++. (Also Eclipse is a great platform, on which you can build interesting, very modular stuff, such as Bioclipse)

I keep an eye at D, since from almost everybody I hear it is a great, clean and powerful language, that fixes the annoying flaws in i.e. C++, while still being a compiled (and thus potentially fast) language. Like Java, it has garbage collection, but the syntax is said to be even cleaner than that of Java, plus you can do really close-to-the system programming/optimization (even write inline assembler code) if needed.

(Much of this, you can read in this question on D, here on BioStar)

Finally a word on python: I think python's ability to compile to C-code with just the addition of typing the variables, with the Cython package, is very neat, and might solve many performance-demanding tasks, if you have reasons to stay in the scripting-world.

ADD COMMENT
4
Entering edit mode
12.7 years ago
Pasta ★ 1.3k

If I were you I would forget about Java: too slow ! My own opinion of course.

C++ seems to be a better option. As a Java-guy initially I found C++ easy to learn, though more complicated than Java.

Moreover you will find plenty of C++ biology-oriented libraries like:

ADD COMMENT
4
Entering edit mode

can you support your opinion -java being too slow- by numbers? http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php

Java certainly lacks genomics libraries.

ADD REPLY
1
Entering edit mode

@Travis: Start slowly with the "Hello world" program. After you can have a look at other developers' programs, that really helps. I used the tutorials on http://www.learncpp.com/

@Ido : Sorry, nothing to show, just my own experience ;)

ADD REPLY
0
Entering edit mode

Thanks. Any tips on a good starting point e.g. a relevant book or tutorial?

ADD REPLY
0
Entering edit mode

@pasta If your program is limited in speed by the startup time (e.g. HelloWorld), then the performance of the language you choose will likely matter very little. thoughts?

ADD REPLY
0
Entering edit mode

It is a common approach to use two languages - one for heavy lifting (C, C++, FORTRAN, used to be assembler but now is uncommon) and another just to connect algorithms and do various computationally non intensive tasks. The idea is that a lot of short lived, single use throwaway code must be written in that "glue" language so it must be highly productive even at expense of performance (Perl, Python, Ruby and of course Java also can do for you). Probably C++ could play both roles at once, who knows. Java currently is not actually slow but of course you can squeeze more from C.

ADD REPLY
4
Entering edit mode
12.7 years ago
ff.cc.cc ★ 1.3k

I suggest you C++. Python is nice and elegant, excellent for prototyping or linking sw modules written in different languages, but C++ is much more EFFECTIVE.

I would also suggest you to keep far from GUI programming (at least for the beginning...) in c++, write good and reusable command line tools and leave the graphical interface to java or web server.

Good starting point are books from Deitel and from Eckel, but also many others. C bible is the book from kernighan & ritchie. C++ bible is the text from stroustrup

Do not write everything completely from the ground ! learn how to link libraries and/or template libraries, (e.g. getopt for command line parsing, gsl or eigen for matrix and linear algebra).

The next step will be STL and boost.org

ADD COMMENT
4
Entering edit mode
12.7 years ago
Marvin ▴ 890

Learn both C (without the ++) and a truly high level language (e.g. Haskell). You will want the high level language most of the time, lest you get bogged down in chasing down segfaults, but you might occasionally need C, both to write that one tight inner loop, and to understand how to call into libraries written in C.

Forget about C++. It is a step up from C, but that tiny step isn't worth the complexity of this Rube Goldberg language. I the end, you'll write C++ programs and combine with a scripting language, probably bash. While the idea is right, it make more sense to use a higher-level scripting languages and a lower-level workhorse.

If Haskell is too alien for your taste, try Lua for a scripting language that combines well with stuff written in C. Oh, and all that is just my personal opinion. YMMD.

ADD COMMENT
0
Entering edit mode

I was thinking in this way too ... even borrowing my collegue's C book ... but then found out about D, and realized that will be the way to avoid segfaults (D has GC), while still writing fast code :)

ADD REPLY
3
Entering edit mode
12.2 years ago

I am also a bioinformatic programmer who started with Perl language. Currently I am using Ruby. It was created with the inspiration of perl.

Currently, I am working with genome data. In terms of processing bulk data, Ruby is better in terms of memory management.

I tried three different programming languages perl, python and ruby for parsing xml formatted genome data. Ruby showed a better response.

It have several features similar like perl with many more advanced features and addon modules.

Learning C and C++ programming is very good for developing an applications with binary level usages and considerations. Ruby have default features in memory management and binary usages.

Ruby supports multiple programming paradigms, including functional, object oriented, imperative and reflective. It also has a dynamic type system and automatic memory management; it is therefore similar in varying respects to Smalltalk, Python, Perl, Lisp, Dylan, Pike, and CLU.

http://en.wikipedia.org/wiki/Ruby_(programming_language)

ADD COMMENT

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6