Question

How Are Rna-Seq Transcripts Assigned?

10

Entering edit mode

12.2 years ago

Dave Bridges ★ 1.4k

I am looking at some RNA-seq data analyzed by cufflinks using standard conditions, and trying to determine the relative ratios of the transcripts of this gene:

sample transcripts

Since some transcripts are entirely contained within longer transcripts, how does cufflinks know if a particular read is to be assigned to a shorter or a longer version of that transcript. As an example, a read towards the 3' end could be assigned to 30958 or 30955. I ask because I noticed for this gene the shorter 3' version (30958) had much higher FPKM than the longer version and I was worried that this is an implicit bias of the software.

rna cufflinks isoform rna-seq • 16k views

ADD COMMENT • link updated 9.7 years ago by Malachi Griffith 19k • written 12.2 years ago by Dave Bridges ★ 1.4k

score 41 · Answer 1 · 2012-01-25

I would urge caution when interpreting isoform specific expression values derived from RNA-seq data, whether you are using Cufflinks or any other RNA-seq assembly/expression package. When you see nice long, assembled transcripts like those in your screenshot it is easy to forget what RNA-seq data actually represents.

In RNA-seq we are not sequencing full-length transcripts or anything even close to them. The average human protein coding transcript has ~8-10 exons and is ~2,000 bp in length. An RNA-seq library is made up of fragments of cDNA of ~200-400 bp in length which are sequenced partially from each end. Furthermore, the strand the original mRNA sequence was transcribed from is unknown in most library preparation strategies. Sometimes we can infer the strand by examining splice site spanning reads. We can also infer local structural information about the transcript a cDNA fragment may have been derived from. But putting together the large transcripts depicted in your screenshot takes a lot more inference. Cufflinks seems to have some clever modeling and inference built into it, but the problem being addressed is very complicated. The larger the transcripts and the more transcripts expressed from a single locus, the harder it is to tell what the full-length transcripts really look like and how abundant each one is. Looking at your example, you can easily see that the vast majority of short reads will correspond to multiple transcripts.

Each transcript has very few (or 0) exons and exon-exon junctions that are unique to the transcript. As an experiment, try focusing on those features that are unique to a single transcript. For example, determine the reads that map to these exons and junctions and see how the relative ratio of each transcript compares to what Cufflinks tells you. This approach still has the caveat that there may be some other isoform we don't yet know about that shares our supposedly unique features. Again, this isn't a criticism of the Cufflinks approach but rather a reminder of the difficulty of the problem and the potential value of performing isoform specific follow-up experiments (either in silico, in the wet lab, or preferably both). Some approaches (e.g. ALEXA-seq), dodge this problem by focusing on individual sequence features without attempting to infer too much about the structure of full length transcripts. The idea being that if you can identify differentially/alternatively expressed exons or junctions first, you can then follow up in the lab with a technique more suitable to resolving full-length cDNA structures. The gold standard (IMHO) being full-length sequencing and finishing of a large pool of cloned cDNAs generated by reverse transcription with a pool of gene specific primers representing the diversity of known 3' transcript ends at the locus. This is labor intensive...

Perhaps one day we will be sequencing full length cDNA libraries in a massively parallel fashion without the need to fragment. Until then we can easily keep busy working on this problem... :)

A list of alternative splicing tools and resources can be found in this post: Recommended tools for alternative splicing detection from RNA-seq data

score 18 · Answer 2 · 2012-01-26

malachig gave a very good answer but just to address to question of how Cufflinks assigns reads to transcripts, it uses (at its core, though with many bells and whistles) an EM (expectation maximization) algorithm to estimate ML (maximum-likelihood) probabilities of each read coming from a certain isoform.

Imagine that you have a set of reads mapping to the gene in question, and a set of isoforms for that gene. If you knew the abundance of each isoform, you could probabilistically "distribute" each read across the isoforms that it could possibly have come from (that is, assign a probability for that read coming from each of the matching isoforms), using the abundance of the isoforms. (Of course, you don't have the abundances.) And if you knew in advance the probability of each read coming from each of the isoforms, you could calculate the resulting estimated isoform abundances. Of course, you don't know these either. How to solve this conundrum? By iteration!

The EM algorithm in Cufflinks basically iterates a procedure where it guesses the probabilities of each read coming from each isoform (based on the abundances of the isoforms that are compatible with the read), then re-calculates the abundances with the updated read<->isoform assignment, then again re-assigns the probabilities based on the newly updated abundance estimates, and so on until convergence. (The probabilities and abundances are probably initialized either to be uniform across the reads/isoforms or randomized in some way.) The determination of whether a read is compatible with an isoform can use information from spliced and paired-end alignments.

I may or may not have the steps in the wrong order, but I believe this is essentially how Cufflinks works (in the quantification mode - it also has reference-based transcript assembly modes, of course). Cufflinks has a lot of added features in addition to what I outlined, like a Monte Carlo-based calculation of confidence intervals, bias correction, etc.

For a relatively gentle mathematical description of the basic idea, see the "ancestral paper" An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs (not about Cufflinks but describes the EM framework in simple terms), and for a very good albeit more challenging description of Cufflinks and many other isoform quantification procedures, see Models for transcript quantification from RNA-Seq (by Lior Pachter, one of the people behind Cufflinks.)

Ram · Answer 3 · 2012-01-25

From 'How Cufflinks works':

To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments. This likelihood function can be shown to have a unique maximum, which Cufflinks finds using a numerical optimization algorithm. The program then multiplies these probabilities to compute the overall likelihood that one would observe the fragments in the experiment, given the proposed abundances on the transcripts.

Presumably more details would be found in one of the cufflinks publications (e.g., http://www.nature.com/nbt/journal/v28/n5/full/nbt.1621.html)...

score 0 · Answer 4 · 2013-08-07

Thank you Malachi, Mikael and Andrew for the excellent input. I have a very similar problem in my research: I want to calculate the single-nucleotide resolution abundance for each isoforms, so I need to assign fragments to individual isoforms. It sounds to me that Cufflinks may have obtained this information in calculating transcripts abundance (my understanding from Andrew's answer). However, this information seems lost and Cufflinks only outputs transcripts abundance. So I am wondering whether there is a way to find out this information, either through Cufflinks or from other tools. If no such tool is available, would someone suggest an algorithm?