Biostar Beta. Not for public use.
Question: Fold Change In R
0
Entering edit mode

Hi guys, Brand new to going through RNA-seq data, and learning on the go. I have been trying to make a volcano plot of some example data from GEO, to determine up-regulated genes in the sample. I am having problems with getting reasonable outputs for my fold-change values, with insanely high values over 5000 and below -10000. I am not sure if this is a result of just the data or something I have been doing myself. The reason for the volcano plot is so I can then determine the upregulated genes, usually done by looking at the genes over expressed (>=1).

I have been getting the average of the expression data for the "diabetic" and "non-diabetic". To determine the fold change I have then just subtracted the average of the diabetics from the average of the health. This is based off an example with a different dataset.

avg.t2d <- apply(exdat[,t2d.index==" diabetic"], 1, mean)
avg.healthy <- apply(exdat[,t2d.index==" non-diabetic"], 1, mean)
foldch <- avg.t2d - avg.healthy

logt2dp <- -log(t2dpvals, base=10)
plot(foldch, logt2dp, pch=1, xlab="Log2 Fold Change (L/S)",
     ylab="-Log10(P-val)", main="Volcano Plot", col = ifelse(adj_pvals<0.05, 'green', 'black'))

Am I missing something big here or am I over thinking the problem?

ADD COMMENTlink 2.1 years ago MarjoryMollusc • 40 • updated 2.1 years ago seidel 6.8k
Entering edit mode
1

Use log2 for exploring the fold changes. You could do something like log2(avg.t2d)-log2(avg.healthy)

ADD REPLYlink 2.1 years ago
kautilya
• 380
Entering edit mode
0

Thanks for the reply! I have tried this already, however I had the warnings that NaNs were produced. Is there any way to cut the data that give that output? Or is that a result of errors somewhere else?

ADD REPLYlink 2.1 years ago
MarjoryMollusc
• 40
Entering edit mode
0

The warning indicates that you have markers which have avg.t2d AND avg.healthy as 0. One way of handling this is adding a small pseudo count of eg. 0.0001 to all the mean values. This will ensure that there are no 0's in your data and NANs will be 0 foldch as expected.

ADD REPLYlink 2.1 years ago
kautilya
• 380
Entering edit mode
0

Thanks for explaining the error. Despite addin the pseudocount, it still came up with the error. Would this be because the 0's in the dataset would be in there as Na instead of 0? If so how would you change those Na's to 0's?

ADD REPLYlink 2.1 years ago
MarjoryMollusc
• 40
Entering edit mode
0

Can you provide some examples of the mean values for which you are getting this error?

ADD REPLYlink 2.1 years ago
kautilya
• 380
Entering edit mode
0

I disagree. Don't take logs after taking the mean, Take means after taking the logs, or take logs after taking the geometric means

ADD REPLYlink 2.1 years ago
russhh
♦ 4.4k
Entering edit mode
1

While seidel has addressed your actual question, I just wanted to take a step back and ask: have you considered using any of the established R packages for differential gene expression analysis, such as DESeq2, edgeR, limma? These packages will calculate the log-fold-changes for you (they'll even do it in a somewhat more sophisticated manner taking the quirks of RNA-seq data into account) and they will even offer you (adjusted) p-values to let you make an educated guess as to which genes might be worth looking at.

ADD REPLYlink 2.1 years ago
Friederike
4.2k
Entering edit mode
0

Yeah I have used limma later on. This exercise was for me to try an get a deeper understanding of what is actually going on behind the scenes when using those packages.

ADD REPLYlink 2.1 years ago
MarjoryMollusc
• 40
Entering edit mode
0

That's great! If you want to know about the nitty-gritty details, the introduction of the limma-voom paper mentions a lot of the statistical details that are applied in those packages (which is the reason why simply taking means and log-fold changes won't give you the same results)

ADD REPLYlink 2.1 years ago
Friederike
4.2k
Entering edit mode
0
          NA         NA.1         NA.2         NA.3         NA.4 
9.861355e+03 2.121977e+04 2.202288e+00 1.079073e+01 1.132874e+00 
        NA.5 
1.479524e-01

That is the head of avg.t2d With the pseudocount:

> head(avg.t2dps)
          NA         NA.1         NA.2         NA.3         NA.4 
9.861356e+03 2.121977e+04 2.203288e+00 1.079173e+01 1.133874e+00 
        NA.5 
1.489524e-01
ADD REPLYlink 2.1 years ago
MarjoryMollusc
• 40
2
Entering edit mode

You have to know what's going on with your data. It looks like it's already in log format because you're taking the ratio as: foldch <- avg.t2d - avg.healthy

A few things to try:

# are there NA values in the original data?
sum( is.na(avg.t2d) )
sum( is.na(avg.healthy) )

# are there other problematic values in the original data?
sum(is.finite(avg.t2d))
sum(is.finite(avg.healthy))

# for two vectors, examine their comparability
complete.cases(avg.t2d, avg.healthy)

# this also works for rows of a matrix
# (first use cbind to combine your vectors as matrix coloumns)
x <- cbind(avg.t2d, avg.healthy)
complete.cases(x)

# examine cases where the value is not defined for both variables
x[!complete.cases(x),]

# omit cases where the value is not defined for both variables
x <- x[complete.cases(x),]

# you can also use na.omit for this
foo <- na.omit(cbind(avg.t2d, avg.healthy))

# converting to data frame allows you to use your column names again
foo <- as.data.frame(foo)

# gene expression ratio
foldch <- foo$avg.t2d - foo$avg.healthy
ADD COMMENTlink 2.1 years ago seidel 6.8k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0