TPM values of expressed genes
3
2
Entering edit mode
5.1 years ago
Bogdan ★ 1.4k

Dear all,

considering a RNA-seq experiment and analysis that provides the expression values as TPM, please would you let me know what is a minimum TPM value in order to consider a gene to be expressed ?

talking about RPKM.FPKM units, I remember that a gene was considered expressed if RPKM (or FPKM) > 1 ... thanks a lot,

-- bogdan

RNA-Seq • 16k views
ADD COMMENT
0
Entering edit mode

Thank you very much for your comments and insights ;)

ADD REPLY
6
Entering edit mode
5.1 years ago

I do not believe there is any definitive answer. There are so many factors that go into each experiment such that it is difficult to pick a value. A RPKM / FPKM value of 1 seems quite low to me, i.e., in 'error' territory. What you have to consider is the distribution of your data and the suitability of it for whatever downstream tools you will use. If including low-count / low expressed genes is going to distort your data distribution and introduce biases, then you need to remove them - check via histograms.

From RNA-seq, most genes are lowly expressed, possibly due to transcriptional 'noise' more than anything else. I say 'noise' in the knowing that they may reflect genuine transcription but have no regulatory function and are artifacts of other transcriptional processes that have occurred. They may also reflect regions where TF binding and/or promoter activity was weak.

So, you have the liberty to choose your own cut-off for TPM and state it in the methods. :)

Please take the time to read Gordon's answer, here: https://support.bioconductor.org/p/98820/#98875

Kevin

Edit: another interesting discussion: https://www.researchgate.net/post/How_to_determine_whether_a_gene_is_expressed_in_RNA-seq2

ADD COMMENT
1
Entering edit mode

Naive cutoffs will probably miss lowly-expressed but important genes. See the last paragraph of this post of Obi Griffith --- How Much Coverage Do We Need For An Rna-Seq Experiment?

ADD REPLY
4
Entering edit mode
5.1 years ago
igor 13k

As already pointed out, there is no ideal cutoff. However, there is at least one method, zFPKM, that tries to define an expression cutoff.

BioC: https://bioconductor.org/packages/release/bioc/vignettes/zFPKM/inst/doc/zFPKM.html

Publication: https://www.ncbi.nlm.nih.gov/pubmed/24215113

the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. ... We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations.

ADD COMMENT
0
Entering edit mode

I have been using zFPKM more and more in situations where I have encountered FPKM data. I believe I saw that you mentioned it in another post a few months back. Thanks Igor.

ADD REPLY
4
Entering edit mode
5.1 years ago

There is no such thing as a cut off, because there is no such thing as not expressed - the whole genome is transcribed at some level in any given cell type. However, this doesn't stop us from sometimes needing to make a decision: which genes to include in an metagene analysis for example.

One a purely technical note, one could define a cut off as the point at which we can't distinguish between low expression levels and technical noise. It seems like zFPKM is doing something similar to this, but hand-rolled versions I've seen shift exons into nearby, but unexpressed, GC matched genome regions and then quantify them to get an average signal for strictly unexpressed sequence. I guess the difference between this and zFPKM is that zFKPM gives you a level for "unexpressed genes", where as this gives out a baseline level for "not genes".

A different approach is to think about what the level actually means. This is next to impossible with FPKM, but FPKM is readily translatable into TPM, and then the meaning is quite concrete. For example, if the average cell has 200,000 mRNA molecules in it at any one time, then a TPM of 5 would translate to 1 molecule per cell on average at any one time.

Finally, you could think in a distributional sense. I think in the last sample I looked at TPM 5 put you in the top 10% most highly expressed genes.

In the end it depends on what the purpose of the threshold is.

ADD COMMENT

Login before adding your answer.

Traffic: 1431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6