Question

Where is the (hg19) reference data for transcript isoforms' relative abundances for each gene?

0

Entering edit mode

5.0 years ago

Joel Wallenius ▴ 210

Hello!

I've been googling and thus become aware that to estimate effective transcript lengths (as part of rna expression analyses), one should use a weighted average of the lengths of a gene's many transcript isoforms. But for the life of me I can't find a site that houses this relative isoform abundance information. I have the RefSeq transcriptome ready, but it has only the isoforms, not their (average/approximate)relative abundances. What do I do?

Big thanks in advance!

RNA-Seq transcriptomics isoforms • 1.4k views

ADD COMMENT • link updated 5.0 years ago by i.sudbery 19k • written 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

You may want to check Illumina's human bodymap data available from ArrayExpress.

ADD REPLY • link 5.0 years ago by GenoMax 141k

score 2 · Accepted Answer · 2019-05-02

2

Entering edit mode

5.0 years ago

i.sudbery 19k

As far as I am aware, no such "reference" dataset exists.

The relative abundance of transcripts will vary between cell types and individuals, even between individual cells. It is highly context dependent. Thus, in order to answer your question you need to think about what cell type/tissue/situation your are interested in.

The following projects house large amounts of expression data for different tissues/cell types:

Genotype-Expression project (GTEx)

EBI Expression Atlas

The Cancer Genome Atlas Project (TCGA) Firehose portal

Unfortunately many of these will be on hg38, and you may have to do some conversion.

ADD COMMENT • link 5.0 years ago by i.sudbery 19k

0

Entering edit mode

OK, I suspected as much... thank you!

But wait - how then do various softwares calculate FPKM values using effective lengths derived from the weighted average of each gene's isoforms?

ADD REPLY • link 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

Software like Salmon or Kalisto will:

1) Estimate the expression of each isoform in TPM. 2) Sum these to get the expression of the gene 3) Calcuate an effective length from using the weight average of each isoform's expression. 4) This effective length is then used as an offset in DE analysis.

So you see that the software calculates effective length from expression, not vice versa, although perhaps a tool that works in FPKM (there are not many of those around now, and I would hesitate to suggest any that does unless you have a very good reason), I guess will need to calculate the effective length in order to calculate the gene FPKM (as opposed to TPM, where its a simple sum).

ADD REPLY • link 5.0 years ago by i.sudbery 19k

0

Entering edit mode

But how exactly do they estimate the expression of each isoform? EM? Is it much more than a pure guess? Reads can't typically tell them apart... or are they looking at reads that overlap the exon boundaries as indicators maybe?

TPM still uses the gene length in its calculation, no way around it. If you want number of mRNA, rather than the mRNA material, you must divide by the length... am I wrong?

ADD REPLY • link 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

How the expression level is estimated depends on the algorithm, but, yes, Salmon, and I believe RSEM and Kallisto all use some variation on EM to estimate isoform levels. At least in Salmon and Kalisto, reads are divided into compatibility classes - a compatibility class is a set of reads that could come from a given set of transcripts. So If a gene has a three transcripts A, B and C, there might be a compatibility class for reads that could come from either A or B, a class for those that could come for A or C, a class for reads that could have come from any and a class for reads that could only have come from B. EM is then applied to figure out the most likely expression value for each isoform. Obviously those that overlap exon boundaries, or in unique exons, are the most informative.

TPM calculation don't use gene length, it uses transcript length. Going back to our example, lets say our three transcripts have 100, 200 and 300 reads respectively. And they are 1kb, 1.5kb and 2kb long each. Thus,

Isoform A has 100 reads per kb,
Isoform B has 132 reads per kb
Isoform C has 150 reads per kb.

The next step is to convert this to TPM. To do this we need to sum across all transcripts in transcriptome. Lets say that when we do this we get a total on 2,000,000. To get TPM for each isofrom we divide by 2,000,000 (that gives of fraction on transcripts) and then multiply by 1,000,000 (to give Transcripts Per Million). This amounts to dividing by 2 in this example.

So the expression of each isoform is;

Isoform A: 50 transcripts per million
Isoform B: 66 transcripts per million
Isoform C: 75 transcripts per million.

Now, to calculate the expression of the gene, we simply add these together, so 191 transcripts in every million are from our gene. (i.e. the TPM of the gene is 191).

To calaculate the effective length of the gene, we take the weighted averages of the isoform lengths:
(501kb + 661.5kb + 75*2kb)/191 = 1.57kb.

ADD REPLY • link 4.9 years ago by i.sudbery 19k

0

Entering edit mode

Sorry I forgot about this one. Thank you for spelling that out! It makes a lot of sense to me. I don't think I can make much use of it though since my PI is forcing me to use htseq, which doesn't count on the transcript level. Hmm... so the "TPM" I'm calculating (the lengths I use are just the sum of all exon lengths minus the average read length), is more like a GPM, I suppose. ^_^

Thanks again!

ADD REPLY • link 4.9 years ago by Joel Wallenius ▴ 210