Question

Main difference between RSEM and k-mer based quantification methods

3

Entering edit mode

7.2 years ago

uki_al ▴ 50

Hi I was researching quantification methods and I had two questions that I could use help on:

What is the main difference between alignment-based and alignment-free methods in rna-seq quantification, precision-wise? I really liked some of the methods implemented in tools like Salmon or Kallisto, so I was just wondering is there any loss in accuracy due to full-alignment being dropped? Does quasi-mapping or pseudo-alignment result in any important information loss? From all the different papers around I see that RSEM has like some small advantage in accuracy (I could be interpreting these results wrong though), I was just wondering why is that so?
RSEM does not work with gapped alignment - why is that so and what does it mean? Does that mean that if there is a potential indel in the sample, the aligner should not report it in order for RSEM to work? Does that affect the accuracy of the results in some way? Do k-mer based methods not have this problem?

RNA-Seq salmon rsem kallisto • 5.6k views

ADD COMMENT • link 7.2 years ago by uki_al ▴ 50

2

Entering edit mode

The answer to (1) is covered in the salmon/kallisto papers.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Hi, thanks Yes I've seen in the kallisto paper that, for example, the accuracy of kallisto is close to RSEM (the median relative difference in the estimated TPM values of each transcript as means across 20 RSEM simulations of reads based on TPM estimates and error profile of Geuvadis sample NA12716 is 0.16 for kallisto and 0.11 for RSEM). From this comparison you could deduce that the accuracy is similar. But I was more curious in the subtle differences - e.g. why still even the small difference between the two (0.16 for kallisto and 0.11 for RSEM)? This would imply that there is some small information loss present with kallisto. Does this have to do with the alignment/pseudo-alignment part, or the niche of the EM algorithm used, or something else?

ADD REPLY • link 7.2 years ago by uki_al ▴ 50

score 8 · Answer 1 · 2017-02-20

Just to comment a bit on Devon's (accurate & ideal) response. We've actually been looking into the fine-grained particulars of this and have a paper under submission that explores the nature of the small differences in some detail. There are at least two potential reasons for the differences one observes in practice. In Salmon, by running in alignment-based mode, we can set aside the alignment vs. "mapping" difference and just look at how accuracy (as judged by the useful but less-than-perfect metric of accuracy on data simulated using RSEM-derived quantification) varies between methods.

One hypothesis that withstood our testing is that the difference can be explained by the fact that certain methods factorize the likelihood that is being optimized (i.e., when running their EM / VBEM procedures, they do not consider each fragment independently, but group certain fragments together for the purposes of quantification). We were also able to derive new factorizations with very similar performance (in terms of time / memory requirements) to the ones currently used. These new factorizations, nonetheless, don't exhibit easily measurable differences from methods, like RSEM, that optimize an un-factorized (or full) likelihood. That is, there are groupings that factorize the likelihood in a different (and more data-dependent) way, that are more faithful to the un-factorized likelihood. I'll note here that there are also small differences attributable to traditional alignment vs. fast mapping strategies. We are investigating these further as well, though one must be particularly careful here not to bias one's validation considering how simulated data is often generated using a model that incorporates (and encodes important information) in alignment characteristics.

Regarding (2); the reasoning behind this is likely more historical than anything else. RSEM incorporates a model of alignment that simply doesn't model insertions or deletions, though there is nothing inherent about the method itself that precludes this. For example, Salmon (in alignment-based mode), Tigar, eXpress, BitSeq and many other tools support insertions and deletions in the alignments. RSEM will simply not process samples with indels in the alignments (that's why, if you use the builtin wrapper scripts to process the reads, RSEM will run Bowtie2 in a manner that disallows insertions or deletions in the mappings). There has not, to my knowledge, been a detailed study on the effect this has on accuracy in different cases. In most common cases, one would expect indels to be rather rare and, therefore, the effect of ignoring them to be rather small. On the other hand, it certainly seems possible that, if important (unknown) indels exist, allowing reads to align / map over them could improve quantification. Existing alignment-free methods will map (and account in quantification) for reads that exhibit indels with respect to the reference.