Entering edit mode

Hi guys, Brand new to going through RNA-seq data, and learning on the go. I have been trying to make a volcano plot of some example data from GEO, to determine up-regulated genes in the sample. I am having problems with getting reasonable outputs for my fold-change values, with insanely high values over 5000 and below -10000. I am not sure if this is a result of just the data or something I have been doing myself. The reason for the volcano plot is so I can then determine the upregulated genes, usually done by looking at the genes over expressed (>=1).

I have been getting the average of the expression data for the "diabetic" and "non-diabetic". To determine the fold change I have then just subtracted the average of the diabetics from the average of the health. This is based off an example with a different dataset.

```
avg.t2d <- apply(exdat[,t2d.index==" diabetic"], 1, mean)
avg.healthy <- apply(exdat[,t2d.index==" non-diabetic"], 1, mean)
foldch <- avg.t2d - avg.healthy
logt2dp <- -log(t2dpvals, base=10)
plot(foldch, logt2dp, pch=1, xlab="Log2 Fold Change (L/S)",
ylab="-Log10(P-val)", main="Volcano Plot", col = ifelse(adj_pvals<0.05, 'green', 'black'))
```

Am I missing something big here or am I over thinking the problem?

Entering edit mode

You have to know what's going on with your data. It looks like it's already in log format because you're taking the ratio as: foldch <- avg.t2d - avg.healthy

A few things to try:

```
# are there NA values in the original data?
sum( is.na(avg.t2d) )
sum( is.na(avg.healthy) )
# are there other problematic values in the original data?
sum(is.finite(avg.t2d))
sum(is.finite(avg.healthy))
# for two vectors, examine their comparability
complete.cases(avg.t2d, avg.healthy)
# this also works for rows of a matrix
# (first use cbind to combine your vectors as matrix coloumns)
x <- cbind(avg.t2d, avg.healthy)
complete.cases(x)
# examine cases where the value is not defined for both variables
x[!complete.cases(x),]
# omit cases where the value is not defined for both variables
x <- x[complete.cases(x),]
# you can also use na.omit for this
foo <- na.omit(cbind(avg.t2d, avg.healthy))
# converting to data frame allows you to use your column names again
foo <- as.data.frame(foo)
# gene expression ratio
foldch <- foo$avg.t2d - foo$avg.healthy
```

Loading Similar Posts

Use log2 for exploring the fold changes. You could do something like

`log2(avg.t2d)-log2(avg.healthy)`

Thanks for the reply! I have tried this already, however I had the warnings that NaNs were produced. Is there any way to cut the data that give that output? Or is that a result of errors somewhere else?

The warning indicates that you have markers which have

`avg.t2d AND avg.healthy`

as 0. One way of handling this is adding a small pseudo count of eg. 0.0001 to all the mean values. This will ensure that there are no 0's in your data and NANs will be 0 foldch as expected.Thanks for explaining the error. Despite addin the pseudocount, it still came up with the error. Would this be because the 0's in the dataset would be in there as Na instead of 0? If so how would you change those Na's to 0's?

Can you provide some examples of the mean values for which you are getting this error?

I disagree. Don't take logs after taking the mean, Take means after taking the logs, or take logs after taking the geometric means

While seidel has addressed your actual question, I just wanted to take a step back and ask: have you considered using any of the established R packages for differential gene expression analysis, such as DESeq2, edgeR, limma? These packages will calculate the log-fold-changes for you (they'll even do it in a somewhat more sophisticated manner taking the quirks of RNA-seq data into account) and they will even offer you (adjusted) p-values to let you make an educated guess as to which genes might be worth looking at.

Yeah I have used limma later on. This exercise was for me to try an get a deeper understanding of what is actually going on behind the scenes when using those packages.

That's great! If you want to know about the nitty-gritty details, the introduction of the limma-voom paper mentions a lot of the statistical details that are applied in those packages (which is the reason why simply taking means and log-fold changes won't give you the same results)

That is the head of avg.t2d With the pseudocount: