Question

Alignment Free Expression Quantification Using Sailfish: Any Experience?

8

Entering edit mode

10.7 years ago

johnstantongeddes ▴ 410

A paper for RNAseq alignment-free isoform quantification, with the software called sailfish, was recently posted to arXiv. Discussion here, here and here. The method is "18 to 29 times faster than the next fastest method while providing expression estimates of equal accuracy" by using counts of k-mers instead of mapping reads.

Looks promising - any thoughts or experience with this? How quickly (if at all?) should we be switching from our current expression quantification pipelines?

rnaseq isoform • 4.2k views

ADD COMMENT • link updated 10.6 years ago by Madelaine Gogol 5.3k • written 10.7 years ago by johnstantongeddes ▴ 410

0

Entering edit mode

Well, I thought I'd try it. Then I started installing cmake. Then I found out I needed a newer version of g++. I'll get back to you.

ADD REPLY • link 10.6 years ago by Madelaine Gogol 5.3k

1

Entering edit mode

Well, I gave up on trying to build from source... But I just ran the executable on the sample data.

ADD REPLY • link 10.6 years ago by Madelaine Gogol 5.3k

Ram · Answer 1 · 2013-09-05

3

Entering edit mode

10.6 years ago

Madelaine Gogol 5.3k

Well, I tried it out! It still looks promising. I just ran it on some yeast data and compared it a bit to manually calculated RPKMs and some from cufflinks. It was really fast and the RPKMs look good -- cufflinks is the oddball here, while the sailfish RPKMs are closer to the manually calculated ones (from tophat counts with uniquely aligned reads). The bias correction looks a bit odd for one sample, so I'd look into that a bit more before I used that.

I assume the researchers we work with might feel more comfortable using something like this once it's actually published and everything, but it seems worth a try to me. Especially for cases where people are interested in a fairly simple known gene expression type of analysis, this could offer a nice (quick) quantification.

Of course, anytime you switch to a new method, it may be good to do both for awhile just so you notice if anything is really different.

Edited to add -- The following was my fault, do not worry about it. Note: I initially tried it on some human data and building the index was taking too long, so I switched to the yeast data set. Perhaps I was being too polite with our shared server and it would have sped up more if I asked for more threads. But I let it run for 5 days or so before I gave up. Of course, this step only has to be done once for a given transcriptome and k-mer size.

ADD COMMENT • link updated 5.2 years ago by Ram 43k • written 10.6 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Hi Madelaine, I'm one of authors of Sailfish. Thanks for trying the software! We'd be interested to hear about any experiences that you had running Sailfish. As you know, it's in beta so we're still trying to squash bugs and work out kinks. Would you mind mentioning some info on the transcriptome for which you were building the index? Building the index is linear in the number of distinct kmers, but we've noticed that if you have a very large transcriptome, the memory requirements for building the perfect hash can climb pretty high and any paging out to disk could slow things down a lot. The reason I ask is because the transcriptome we used for the synthetic tests in our paper contained about 105K transcripts and building the index only took 5-10 minutes. The memory overhead of the hash itself, once built, is fairly small, but there is a significant constant-factor memory overhead involved in constructing a minimum perfect hash. We're currently looking into an out-of-memory index construction procedure that is potentially more suitable to large transcriptomes and could be used in such cases. Thanks again for giving our software a try!

ADD REPLY • link 10.6 years ago by Rob 6.5k

0

Entering edit mode

Thanks Madelaine and Rob for responding! I'm keen to evaluate it on my data but have been side-tracked...will report my experiences hopefully soon.

ADD REPLY • link 10.6 years ago by johnstantongeddes ▴ 410

0

Entering edit mode

Hey, thanks for responding! I was using ensembl 72 human transcripts, there are ~ 194k transcripts in my fasta file. I was working on a server with a ton of memory... After I posted this, I saw your google group and read you were working on large transcriptomes.

I just took a look and my transcript fasta is 6 gigs. Didn't realize it was that beefy... I think this must be a pretty inclusive transcriptome, someone else in my group actually constructed it. Here it is if you want to take a look.

ADD REPLY • link updated 5.2 years ago by Ram 43k • written 10.6 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Ah... I think my "transcriptome" actually includes introns. That would perhaps explain some of the hugeness.

ADD REPLY • link 10.6 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Hi Madelaine. Yes, I was just going to e-mail you regarding this as I was perplexed by it. It looks like the file you linked has the full (unspliced) genomic DNA of all of the transcripts. If you grab the ensemble 72 human reference cDNA (same set of transcripts), it's about 317Mb; it shouldn't take more than 10-20 minutes to build the index on that.