Question

A very short gene with very high TPM

0

Entering edit mode

6.2 years ago

corend ▴ 70

I assembled my RNA-seq data using cufflinks. This computed FPKM for each gene assembled, and I converted it manually to TPM using: [FPKM/sum(FPKM)]*1e6

I have a gene in all replicates that reaches 600k TPM, meaning that 60% of my transcripts are coming from this gene. I checked his length, 64nt in all of my replicates. And the number of reads matching it (20), in the first replicate.

Is it possible that this transcript is creating a biais in my data? Is it normal that only few reads can create a so high TPM? Maybe miss-mapped reads are creating a huge biais?

EDIT: Some more informations about my data:

RNA-seq, polyA library, unstranded, 3 ovary replicates, 3 testis replicates from my favorite species.

I cleaned my data, removed adapters, bad quality reads etc... The FastQC report is nice.

I just blasted my short sequence and I have 2 hits, only on my species. No predicted genes from ensembl or NCBI at this position. In my 3 ovary replicates I have this overTPMed gene (600k, 400k, 400k). And it is not in my testis replicates. When I check my bams with IGV, I see reads mapping this region in testes replicates too (still fewer than in ovaries), even if they are not assembled in the cufflinks output.

I just ordered my TPM table, I saw many other genes with anormal high TPM. I filtered out assembled genes with < 200nt. And rebuild my TPM table. Now the highest gene is 17k TPM. Is it a good thing to do in that case ?

RNA-Seq Assembly • 3.7k views

ADD COMMENT • link updated 6.2 years ago by Friederike 8.9k • written 6.2 years ago by corend ▴ 70

0

Entering edit mode

bad quality reads etc... The FastQC report is nice

What criteria was used to define "bad" quality? It is not necessary to have nice FastQC report to do further analysis. One can be overzealous in the "cleanup" that can introduce some other bias in the data.

Since you are working with reproductive organs perhaps the gene (fragment) you are seeing may really be overexpressed?

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

I did not asked myself those questions, I used fastQC on raw reads, seeing that quality is not optimal and that I had adapters.

Then I used a cleaner (UrQt), and running fastQC on those cleaned reads I had no low quality bases, and no adapters anymore. I assumed that my data was ok for further analyses.

I expect overexpressed genes. But a genes producing 50% of my mRNA. It is too much in my opinion. Checking the real number of reads, which was pretty low, I came here to understand what's going on.

ADD REPLY • link 6.2 years ago by corend ▴ 70

2

Entering edit mode

6.2 years ago

Friederike 8.9k

Does your favorite species have an annotated transcriptome?

The reason I ask is because I think I remember that Cufflinks has a problem with single-exon, short transcripts (see, for example, this paper, from which this quote was taken:

All transcripts with length less than 200nt were extracted, and their corresponding accuracy metrics were calculated using counts and TPM values respectively (...) The Pearson coefficient was negative for Cufflinks (...).

(from the legend of Supplemental Figure 4))

So, maybe it's worth taking the generally more recommend route of kallisto/salmon + tximport.

In addition, I highly recommend you visually check the reads that align to that locus, e.g. using IGV

ADD COMMENT • link 6.2 years ago by Friederike 8.9k

0

Entering edit mode

My species does have an annotated transcriptome. But it is not comprehensive. I would to discover new transcript, and get the TPM expression for all of them. And on the one hand, use kallisto + tximport to find DE genes on my new transcriptome. And on the other hand have the TPM values for all genes to assess which one are the more expressed.

Is there another way to get TPM values from those newly assembled transcripts without using cufflinks?

I checked my BAMS with IGV, and there are few reads mapping those short regions (around 20reads for 60bp).

ADD REPLY • link 6.2 years ago by corend ▴ 70

0

Entering edit mode

Is there another way to get TPM values from those newly assembled transcripts without using cufflinks?

Does cufflinks return the gene models, e.g. in gtf file format that you could use to build a transcriptome index for kallisto or to use with featureCounts?

Apart from that, there seems to be no shortage of transcriptome assemblers, I vaguely remember that RSEM tends to do well in benchmark papers.

I have no experience with transcriptome assembly, but I assume that the good old rule "the more the better" holds true as well, so you may want to consider using all the reads from all your samples and generate a transcriptome model that you then use with the usual reference-based methods. There may be good reasons not to do that, so reading up a bit more on the intricacies of transcriptome assembly may be warranted.

ADD REPLY • link 6.2 years ago by Friederike 8.9k

1

Entering edit mode

6.2 years ago

h.mon 35k

Is it possible that this transcript is creating a biais in my data?

Yes, it is possible. Due to its high counts, the other transcripts will be less abundant in your sequencing, and there will be less power to estimate their abundance. And if this transcript is present in some samples and not others, this may create a bias which will affect differential expression analyses as well.

Is it normal that only few reads can create a so high TPM?

It is possible, though not common. However, we know nothing about your experiment, so we don't have enough information to give reasonable answers.

Maybe miss-mapped reads are creating a huge biais?

Blast the short contig to see if you can find its origin. Maybe it is adapters? Maybe ribosomal RNA? You didn't tell us what pre-processing steps you performed, nor the kind of sample / tissue / library you have.

ADD COMMENT • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Just edited my post, thanks

ADD REPLY • link 6.2 years ago by corend ▴ 70

0

Entering edit mode

6.2 years ago

harold.smith.tarheel ★ 4.9k

The only short sequence string that's likely to present in 60% of reads is the adapter.

ADD COMMENT • link 6.2 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

I don't understand. 60% of my reads are not mapping to this gene. Only few of them are. But after TPM calculation it is 600k.

Am I wrong?

ADD REPLY • link 6.2 years ago by corend ▴ 70

1

Entering edit mode

You're correct. The TPM feature length scaling is taking a few reads and assigning most of the TPM space to them.

ADD REPLY • link 6.2 years ago by Matt Shirley 10k

0

Entering edit mode

I'm not convinced it's solely a problem of TPM since that should take care of the relative abundance issue. 20 reads for 60 bp does not seem like it should yield a massive enrichment, but my gut feeling could, of course, be completely off (since the result does depend on the rest of the transcriptome, sequencing depth etc.)

In fact, you could calculate the TPM yourself based on the number of reads that you see in IGV etc. (e.g., following Harold Pimentel's R code)

I think the issue is that Cufflinks does some magic when calculating its FPKMs which seems to yield a weird result for the described cases.

ADD REPLY • link 6.2 years ago by Friederike 8.9k

score 4 · Accepted Answer · 2018-01-22

4

Entering edit mode

6.2 years ago

Matt Shirley 10k

Your observation of very short transcripts receiving a disproportionate amount of the TPM is consistent with my observations. It's clear why this is the case. TPM is roughly:

(feature counts / feature length)
/ (sum of [all feature counts / feature lengths]) * 1e6

and so for features with a small length you will get a much larger numerator even will small counts.

One way to mitigate this issue is by using TMM scaling on your matrix of TPM values to remove the contribution of extreme values, maintaining the hypothesis that most genes are not differentially expressed in an experiment.

ADD COMMENT • link 6.2 years ago by Matt Shirley 10k

0

Entering edit mode

What I am reading right now is that TMM scaling is usable on expected read count and not on TPM values. You think that I can use TMM on my TPM matrix?

TMM should be computed in terms of expected read counts.

I tried TMM with edgeR:

Input TPM Table: o

            o1         o2      o3         t1       t2       t3
XLOC_000001 14.290500  8.78600 14.069866  1.376451 3.24118  2.40424
XLOC_000002  5.644613 25.43497 13.819135 13.292400 9.45567  9.58778
XLOC_000003 10.140201  9.85379  7.777245  9.308872 8.46203  8.29616

d = DGEList(counts=o)
d = calcNormFactors(d,method="TMM")
counts.per.m <- as.data.frame(cpm(d, normalized.lib.sizes=TRUE))

Output TMM Table: counts.per.m

            o1         o2       o3        t1        t2       t3
XLOC_000001 17.001266  8.05838  7.289297  1.285580  4.84603  3.0483
XLOC_000002  6.715847 23.32895  7.159485 12.413405 14.12459 12.1217
XLOC_000003 12.072991  9.03748  4.029581  8.692491 12.66794 10.5005

I checked if I still have those superhigh TPM values, and yes, my anormal TPM values are still anormal in TMM table.

ADD REPLY • link 6.2 years ago by corend ▴ 70

1

Entering edit mode

I should have been more clear. The issue I solve by using TMM scaling is that because TPM is a unity measure (the sum of all values must be 1e6), highly expressed transcripts/genes in a single sample will cause all other genes to appear down regulated, when in fact they are not. TMM normalization scaling factors will actually make your issue worse, by scaling highly expressed genes to an even higher value. In your case it may be that features smaller than the insert size of your library cannot be accurately measured using TPM, and you may have better luck using counts data as input for differential expression analysis. From the paper I linked above:

The number of tags expected to map to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. Thus, if a large number of genes are unique to, or highly expressed in, one experimental condition, the sequencing 'real estate' available for the remaining genes in that sample is decreased. If not adjusted for, this sampling artifact can force the DE analysis to be skewed towards one experimental condition.

ADD REPLY • link 6.2 years ago by Matt Shirley 10k