Biostar Beta. Not for public use.
To What Degree Could We Determine Whether A Gene Is Expressed?
7
Entering edit mode
13 months ago
liupfskygre • 190
United States

Hi all, A question came to my mind when analysis gene expression using RNA-seq. After using DEseq2, a large number of genes were listed as differential expressed (e.g. FDR<0.01 or 0.05, absolute of log2 fold change >1). When took look at the FPKM of those genes, lots of them were very small. I think took all of them for further analysis is time consuming and also included a bunch of noises.

I saw a paper use the FPKM of one housekeeping gene they used to do QPCR as the cutoff. (http://aem.asm.org/content/79/7/2397.abstract)

In my opinion, use the mean FPKM (total FPKM of genes/gene number) of each condition as a cutoff/threshold might be a choice.

Any other thoughts on this issue are greatly appreciated!

fpkm rna-seq • 5.3k views
ADD COMMENTlink
0
Entering edit mode

Are you removing genes with very low overall read counts from your analysis?

ADD REPLYlink
10
Entering edit mode
2.2 years ago
Ryan Dale 4.8k
Bethesda, MD

This 2013 BMC Genomics paper describes zFPKM, which they use to set a single threshold for yes/no expression across a bunch of experiments. Their methods are clear and straightforward to implement -- here's how you might calculate zFPKM in Python:

from scipy.stats import gaussian_kde
import numpy as np

# assume "fpkm" is a NumPy array of log2(fpkm) values.

kernel = gaussian_kde(fpkm)
xi = np.linspace(fpkm.min(), fpkm.max(), 100)
yi = kernel.evaluate(xi)
mu = xi[np.argmax(yi)]
U = fpkm[fpkm > mu].mean()
sigma = (U - mu) * np.sqrt(np.pi / 2)
zFPKM = (fpkm - mu) / sigma

Then you can choose a zFPKM value to use as your threshold. They chose zFPKM = -3 for human RNA-seq after comparing with active/repressive histone mod data.

ADD COMMENTlink
0
Entering edit mode

thank you! this is what I am looking for!

ADD REPLYlink
0
Entering edit mode

Hi Daler! I am new to programming and am having some trouble creating a NumPy array. When I use kernel=gaussian_kde(fpkm), I get this error:

>>> kernel=gaussian_kde(fpkm1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/kde.py", line 188, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

I was wondering if you could please explain how to create a NumPy array from an Excel, .txt or .csv file? I appreciate your help so much!

ADD REPLYlink
0
Entering edit mode

It's impossible to tell what happened here without knowing how you created fpkm1. But you might want to take a look at the NumPy genfromtxt() docs or the Pandas IO tools documentation.

ADD REPLYlink
0
Entering edit mode

Okay thank you! I'll take a look at those.

ADD REPLYlink
3
Entering edit mode
4.0 years ago
cankutcubuk • 170
Spain

Hi,

Can I apply the zFPKM approach by RPKM data(zRPKM)?

Or which kind of normalization methods are compatible with this method?

By the way, here you can find the R script for zFPKM normalization.

I inspired from the python code which has given above.

install.packages("ks","pracma")


library(ks)


library(pracma)

/ fpkm is an example data /

fpkm <- c(1,2,3,4,5,6,7,8,4,5,6,5,6,5,6,5,5,5,5,6,6,78,8,89,8,8,8,2,2,2,1,1,4,4,4,4,4,4,4,4,4,4,4,3,2,2,3,23,2,3,23,4,2,2,4,23,2,2,

24,4,4,2,2,4,4,4,2,2,4,4,2,2,4,2,45,5,5,5,3,2,2,4,4,4,4,4,4,4,4,4,3,2,2,3,23,2,3,23,4,2,2,4,23,2,2,24,4,4,2,2,4,4,4,2,2,4,4

,2,2,4,2,45,5,5,5,3,2,2)

xi = linspace(min(fpkm),max(fpkm),100)

fhat = kde(x=fpkm,gridsize=100,eval.points=xi) 

/ here I put digits=0. if I you do not round the numbers(yi) the results are a little bit changing./

yi = round(fhat$estimate,digits=0) 

mu = xi[which.max(yi)]

U = mean(fpkm[fpkm>mu])

sigma = (U-mu)* (sqrt(pi/2))

zFPKM = (fpkm - mu) / sigma

Cankut CUBUK

Computational Genomics Program - Systems Genomics Lab

Centro de Investigacion Principe Felipe (CIPF)

C/ Eduardo Primo Yufera nº3
46012 Valencia, Spain
http://bioinfo.cipf.es

ADD COMMENTlink
0
Entering edit mode

Hi,

I was trying to use your script in R, but maybe there are some problem with infinite values.
As far as I understood, the idea is use logFPKM as input, isn't it?

when I runned the script I have this error:

Error en seq.default(x1, x2, length.out = n) :
'from' cannot be NA, NaN or infinite

Maybe you can help me to resolve it? thanks!!

ADD REPLYlink
0
Entering edit mode

I already resolve the problem of infinite values, however the zFPKM that I obtained are all positive... so I consider that there is other problem...

ADD REPLYlink
1
Entering edit mode
9 months ago
Duarte, CA

I use log2(RPKM + 0.1) values for differential expression to minimize the number of low coverage genes with unreasonably high differential expression values.

You can see the relevant benchmarks for various rounding values here:

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

The portions of that paper comparing differential expression algorithms are also summarized in this blog post:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

ADD COMMENTlink
0
Entering edit mode
11 months ago
jockbanan • 380
Czech Republic

From my (limited) experience: If you want more conservative results, try the newest approach from EdgeR package - function glmQLFTest(). It always gave me smaller number of significant genes than DEseq2 and classic EdgeR (in the form that is described in manual), as well as other tools like CLCBio. In addition, no or almost no genes reported by glmQLFTest() were tool-specific. It seems to report almost exactly genes reported by all the other tools I tried and no more, which makes me believe in these results a bit more. It may of course not work this way with your data and you will probably need to set some cutoff even if it will, but it is something to start with.

ADD COMMENTlink
0
Entering edit mode

thanks for this suggestion. I will try it later!

ADD REPLYlink
0
Entering edit mode
12 months ago
Michael Love ♦ 1.8k
United States

Can you clarify, is your question how to tell if a gene is expressed, or differentially expressed?

In the latest release of DESeq2, we have implemented functionality to test for differential expression above a critical threshold, as sometimes users are not interested in _all_ statistically significant genes, but actually wanted to find large fold changes. This can be accomplished in with:

results(dds, lfcThreshold=x, altHypothesis="greaterAbs")

Where x is the critical threshold specified by the user. The fold changes are in the log2 space, so x=1 finds fold changes > 2 and < 1/2, x=2 finds fold changes > 4 and < 1/4, etc. Note that this will provide p-values testing the null hypothesis that |ß| < x, which is not the same as testing the null of |ß| = 0 and filtering out |ß| < x.

ADD COMMENTlink
0
Entering edit mode

https://www.biostars.org/u/8394/ was hoping for a method to determine expression (presumably akin to that used by mas5 on affy arrays), mostly as a method for filtering DE results. Of course there's no great biological definition to go by there, so any answer would be a bit hand wavy. BTW, welcome to biostars.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1