Biostar Beta. Not for public use.
It's about the journey, and not the destination
1
Entering edit mode
2.0 years ago
theobroma22 ♦ 1.1k

I chose the famous quote "It's about the journey and not the destination" as an analogy to "It's about the coding and not the result." It seems many posts these days are about geneN (the result) and not the path taken to use basic coding programs to work with geneN, understand geneN, or even how one got to geneN in the first place. I'm not being critical of these posts and they are quite amusing, and my constructive-self saw this as an opportunity to write a tutorial.

When I take a long moment to acknowledge the math, statistic and code going on in the journey the end result is of course much more satisfying, more complete way to experience gaining wisdom. In addition, afterward, even for someone like myself my shallow knowledge had to go very, very deep at times it felt like I was hitting my head against a rock. To confront this, when I read the math I had to visualize my data in the X dimensional space which allowed me to see how the functions/algorithms fit my data structure, and vice versa. Like the difference in the equation's properties used in for example partial least squares (PLS) versus principal components analysis (PCA). PLS is a linear equation and can be considered an extension of PCA since two matrices are used to do matrix multiplication on the data and so rotates one of the matrices, rather than in PCA only using one matrix that contains both the continuous and categorical data types and trying to rotate data this way to make predictions about the data. As such, below I demonstrate using PLS on a dataset to form a correlation network using the 'mixOmics' R package available on CRAN.

Let's say I have a matrix of all expressed features of the control samples from a timecourse experiment with the expression values as RNA-seq-derived count data. This matrix represents the features or independent variables in the control samples collected at two week intervals, or an N x P matrix, like this:

``````#Often times a data sample is needed when seeking help, but it is also good practice to work with your own random data matrix.
#A little background of our control samples
CON = data.frame(Type=rep("RA", length=7), Time=c("0", "2", "4", "6", "8", "10", "12"))
CON = data.frame(Sample=paste(CON\$Type, CON\$Time, sep = "_"))
CON
#The Y matrix which describes the experimental design aka the metadata.
CON_Y <- as.matrix(data.frame(Duration = c(0, 2, 4, 6, 8, 10, 12)))

#The X matrix of their "raw" or non-normalized count values
set.seed(0408)
CON_X <- matrix(rnorm(17500, mean=5000, sd=500), ncol=7, nrow=2500)
min(CON_X)
max(CON_X)

#Add the samples to Y and X
#Let's tighten it up with the N and P descriptors for each matrix, respectively.
rownames(CON_Y) = CON\$Sample
colnames(CON_X) = CON\$Sample
rownames(CON_X) = paste0("Gene", 1:2500)
tail(CON_X)

#Logbase10 scale the raw count data
logCON <- log(CON_X, 10)
#DO PLS
library(mixOmics)
plsCON <- spls(t(logCON), CON_Y, ncomp = 6)
#notice how the logtransformed CON_X matrix was transformed 90 degrees, and I set the number of components to two.
attributes(plsCON)
plsCON\$ncomp

set.seed(2017)
RA.valid <- perf(plsCON, validation = "loo")
attributes(RA.valid)
RA.valid\$Q2.total
RA.valid\$R2
#Most of the variation was explained in the first components
Let's see what happens when I do this:
plsCON <- spls(t(logCON), CON_Y, ncomp = 2)
set.seed(2017)
RA.valid <- perf(plsCON, validation = "loo")
RA.valid\$Q2.total
RA.valid\$R2

#The individuals plot display
#The variable plot display
plotVar(plsCON, comp = 1:2, plot=TRUE, Y.label = "D1", X.label = "D2", cex = c(0.5, 0.8))
#The biological network
network(plsCON, comp = 1:2, cutoff = 0.95, shape.node = c("rectangle", "rectangle"),
color.node = c("white", "pink"))

#Since we are exploring here, let's reduce the cutoff to a lower value, circle the meta data
network(plsCON, comp = 1:2, cutoff = 0.75, shape.node = c("rectangle", "circle"),
color.node = c("white", "pink"))

#A little background
TMT = data.frame(Type=rep("TA", length=7), Time=c("0", "2", "4", "6", "8", "10", "12"))
TMT = data.frame(Sample=paste(TMT\$Type, TMT\$Time, sep = "_"))

#The Y matrix which describes the experimental design aka the metadata.
TMT_Y <- as.matrix(data.frame(Duration = c(0, 2, 4, 6, 8, 10, 12)))

#The X matrix of their "raw" or non-normalized count values
set.seed(0408)
TMT_X <- matrix(rnorm(17500, mean=10000, sd=1000), ncol=7, nrow=2500)
min(TMT_X)
max(TMT_X)

#Add the samples to Y and X
#Let's tighten it up with the N and P descriptors for each matrix, respectively.
rownames(TMT_Y) = TMT\$Sample
colnames(TMT_X) = TMT\$Sample
rownames(TMT_X) = paste0("Gene", 1:2500)
tail(TMT_X)

#Logbase10 scale the raw count data
logTMT <- log(TMT_X, 10)
#DO PLS
plsTMT <- spls(t(logTMT), TMT_Y, ncomp = 2)
set.seed(2017)
TA.valid <- perf(plsTMT, validation = "loo")
TA.valid\$Q2.total
TA.valid\$R2

#The individuals plot display
plotIndiv(plsTMT, comp = 1:2)
#The variable plot display
plotVar(plsTMT, comp = 1:2, plot=TRUE, Y.label = "D1", X.label = "D2", cex = c(0.5, 0.8))
#The biological network
network(plsTMT, comp = 1:2, cutoff = 0.95, shape.node = c("rectangle", "rectangle"),
color.node = c("white", "pink"))

#Since we are exploring here, let's reduce the cutoff to a lower value, circle the meta data
network(plsTMT, comp = 1:2, cutoff = 0.75, shape.node = c("rectangle", "circle"),
color.node = c("white", "pink"))

Well, since we have both CON and TMT log10 normalized data let's build a bigger network.
X <- cbind(logCON, logTMT)
class(X)
Y <- as.matrix(data.frame(Duration = rep(c(0, 2, 4, 6, 8, 10, 12), times=2),
Treatment=rep(c(0,1), each=7),
#and let's add in a third variable to account for the different freezer space used to store each sample
Storage=rep(c(1,2), each=7)))

#uh oh, I need to name the rows...!!??
rnms <- rbind(CON, TMT)
rownames(Y) = rnms\$Sample

plsX <- spls(t(X), Y, ncomp = 6)

#The individuals plot display
plotIndiv(plsX, comp = 1:2)
#The variable plot display
plotVar(plsX, comp = 1:2, plot=TRUE, Y.label = "D1", X.label = "D2", cex = c(0.5, 0.8))
plotVar(plsX, comp = 2:3, plot=TRUE, Y.label = "D2", X.label = "D3", cex = c(0.5, 0.8))
#Oh, Ok, so most of the variation is explained in the first and second dimensions
#The biological network
network(plsX, comp = 1:2, cutoff = 0.65, shape.node = c("rectangle", "rectangle"),
color.node = c("white", "pink"))

#Since we are exploring here, let's increase the cutoff to a higher value
network(plsX, comp = 1:2, cutoff = 0.989999, shape.node = c("rectangle", "rectangle"),
color.node = c("white", "pink"))
``````

At the end of the day, as I reflect on the journey it took to make this network I learned things like "what did it mean to take two different random distributions using the same set.seed()?" ; "how these two random data distributions behaved (similarly) based on the plotIndiv() and plotVar() functions and graphical displays?" and "how can I manipulate objects I have stored in R to help me annotate or complete dataframes I need to use as input for the various functions?" as well as "looking at the attributes of the function outputs to possibly use these in other analyses and graphical displays. These are just some examples of what I was able to gain when working with a simulated dataset and the package functions. Now, carrying a little bit more confidence in myself I can apply what I learned above to my real dataset which I've been salivating over. In addition, since I have a better or improved basic understanding of what the 'mixOmics' package has to offer as well as understand what the underlying PLS algorithm is doing I hopefully can easily set-up my data matrices to work toward my "destination of genes."

1
Entering edit mode

Thank you for taking the time to write the tutorial. Consider the following comments for improving the content.

Aren't tutorials supposed to be simple to understand (not everyone on Biostars is a native english speaker)? I found it hard to figure out what the starting material is. The choice of the title makes it impossible to figure out what this tutorial is about.

You could also insert the expected plots in right spots so people can get an idea of what the expected results are going to look like.

0
Entering edit mode

I feel this is more of a blog than a tutorial...

0
Entering edit mode

Considering the (nice but aspecific) title of this tutorial this will not be found by people in need of this work. It looks good, but it's unclear what it's about.

"It's about the coding and not the result."

I don't agree with this, you always code in function of the result.