How to compare 2 separate coexpression networks each with the same genes?
1
3
Entering edit mode
6.1 years ago
O.rka ▴ 710

I have 2 coexpression networks: (1) control phenotype; and (2) treatment phenotype. They are using the same subset of genes so they are directly comparable (n=1700). I used WGCNA to create 2 separate coexpression networks using the same soft power threshold (beta=10) for both. This question is more conceptual than actual commands or tools to use but I am looking for methods that can allow me to compare the differences in network topology between the 2 phenotypes. My first idea was to calculate the differential coexpression network as suggested in this paper https://www.nature.com/articles/srep13295 but it's a little weird to work with the resulting matrix. The only other approach I could think of was to have a pairwise analysis of modules between each network to get an overlap measure.

Does anyone know of any methods, tools, or approaches that work well for datasets like this?
My main research question is to identify the biggest changes in network topology between the 2 phenotypes.

Dataset description:

It's a dataset with 49 samples of patients without a disease and 34 with a disease. 28 of the patients have a twin in the dataset (n=28*2=56) and 27 do not have a twin account for in the dataset for a total of 56 + 27 = 83 samples/patients. The gene subset I'm investigating is 1700 so each network consist of a 1700x1700 symmetrical similarity matrix. There is a separate network for the healthy state and a separate one for the diseased state.

coexpression network gene phenotype • 6.9k views
ADD COMMENT
1
Entering edit mode

I have noticed somewhere people topologically analyze each network independently (for example node degree, betweeness centrality, etc) and remove common the highest degrees (for instance) and keep only the genes with the highest centrality value as treatment specific hub genes (differential connectivity). In WGCNA, once I noticed people used control network as a reference network to calculate z-score preservation that will find modules those are more related to the treatment.

ADD REPLY
0
Entering edit mode

That's pretty interesting to use the control network as a base for looking at distributions. Thanks! What about module comparison?

ADD REPLY
1
Entering edit mode

As I mentioned WGCNA by using control network as reference can tell us which module is being more dysregulated in treatment. Higher z score, more a module related to the treatment

ADD REPLY
0
Entering edit mode

I'm having trouble finding references for this approach but it seems very useful. I have been having difficulty relating modules to categorical metadata. Is this what you are referring to above? Or do you mean investigating a particular attribute, e.g. intramodular connectivity of genes, and use all of the values from the control network as a background distribution and do a mannwhitney (or similar) for the module of interest against the background distribution? Sorry to ask so many questions but I just want to make sure I understand what you are suggesting.

ADD REPLY
0
Entering edit mode

Exactly I mean what you mentioned " investigating a particular attribute, e.g. intramodular connectivity of genes, and use all of the values from the control network as a background distribution and do a Mann-Whitney (or similar) for the module of interest against the background distribution". However, the outcome of this procedure just would be; for instance, if the blue module in your network seem to be correlated to the treatment significantly, by this procedure (using control network as reference) you can state this correlation is meaningful not just by chance (higher z-score). Please notice that " relating modules to categorical metadata" step is performed beforehand to find correlated modules to your interesting trait. Personally for my purpose as I don't have access to a trait file (fore example measuring cholesterol, etc) I have to prepare a binary trait file (to show which module or genes are related to the treatment). However, I have already accumulated WGCNA codes for each step and if you want I can share with you but you must change the input, etc based on your purpose.

ADD REPLY
0
Entering edit mode

If you have code snippets that can illustrate how to correlated the eigengenes with categorical data like treatment that would be very useful. I'm doing most of the computation outside of R in Python so a good understanding of the methods is crucial. For example, if I have continuous data then I just calculate the 1st principal component of the module and then pearson correlation with the continuous metadata but you can't do that with the categorical data.

ADD REPLY
0
Entering edit mode

My codes are in R My traits is binary, I then made a 1 and 0 trait file and use

moduleTraitCor = cor(MEs, datTraits, use= "p")

for correlation as mentioned in WGCNA; although I am not sure if I am correct

Please consider this

https://ibb.co/mftzq7

ADD REPLY
0
Entering edit mode

I'm getting values for this but I don't know if it makes sense since we wouldn't necessarily expect there to be a linear correlation between a continuous variable and a categorical variable right?

ADD REPLY
0
Entering edit mode

Actually this is also my question I read somewhere people use biweight midcorrelation (bicor) function instead of Pearson cor function Or if I am not wrong take log of binary trait file. However I got confused which is correct... please share with me if you find the answer Thanks

ADD REPLY
0
Entering edit mode

As the same with module preservation (comparing treatment network with control network), comparing modules with each other (conservation) will give you conserved genes between two interesting modules; implemented in WGCNA

ADD REPLY
4
Entering edit mode
6.1 years ago

You could simultaneously cluster the two graphs using a tensor factorization approach. For more details, see some notes from a workshop I taught.

ADD COMMENT
0
Entering edit mode

Looking forward to looking going very deep into this monday morning when I get to lab. Thanks for posting this and making such a detailed tutorial. Do you think it would work well if there are only 2 networks? I'm looking at the examples I think the test case is (30,30,6)?

ADD REPLY
0
Entering edit mode

Actually, when I post a question I used to describing the experimental design, number of samples, what the treatment is, ect, This Monday morning if you could, that would be great to describe all. By the way I would be so grateful to share whatever I know with you.

Step 4 in the below link is about comparing networks

https://labs.genetics.ucla.edu/horvath/htdocs/CoexpressionNetwork/JMiller/Tutorial%20document.pdf

ADD REPLY
0
Entering edit mode

I will give more details and input on monday but I've updated the question to include the dataset description. Thanks

ADD REPLY
0
Entering edit mode

It would work even with just one graph. The question is whether there is enough structure in the data for the factorization to reveal it.

ADD REPLY
0
Entering edit mode

Do you know of any tutorials that explain the resulting dimensionality from the transformations? Before I go deep into the wikipedia page.

ADD REPLY
1
Entering edit mode

I am not sure what you mean by "explain the resulting dimensionality" but I am assuming your question is about how to interpret the resulting factor matrices. For a genes x genes x conditions tensor, a CP factorization with k components will give you two matrices (actually three but two are the same) of dimensions genes x k and conditions x k. The first one can be interpreted as a gene membership/activity in each of the k clusters and the second can be interpreted as the presence/activity of the k clusters in the different conditions.

ADD REPLY
0
Entering edit mode

When you say "2 are the same", do you mean they are supposed to be identical? I tried using sktensor but I wasn't sure if the implementation was similar to rTensor in Python and R, respectively. I am using the parafac method in tensorly with a subset of the iris dataset to test. Using sertosa [0:50] for network A and versicolor [50:100] for network B. In the case below, I would have 4 "genes", 2 conditions, and 2 components/clusters. How can I interpret these matrices? The first 2 matrices are the same shape but they are not equal to each other. Also, the last matrix has all negative values whose magnitudes do not scale to one. Should I start a new thread for this conversation?

df_sim_A = X_iris.iloc[0:50,:].corr()
df_sim_B = X_iris.iloc[50:100,:].corr()

df_sim_A.shape, df_sim_B.shape, df_sim_A.equals(df_sim_B)

((4, 4), (4, 4), False)

import tensorly as tl
from tensorly.decomposition import parafac
T = tl.tensor(np.stack([df_sim_A, df_sim_B]).transpose(1,2,0))
P = parafac(T, rank=2)

print([x.shape for x in P])
# [(4, 2), (4, 2), (2, 2)]

for x in P:
    print(x)

# [[-1.64835237 -0.42324339]
#  [-1.54643444 -0.47574374]
#  [-1.96478944  0.24035087]
#  [-1.90195131  0.14699458]]
# <NDArray 4x2 @cpu(0)>

# [[ 0.46201557  0.66333921]
#  [ 0.43344127  0.74557299]
#  [ 0.55074777 -0.37620981]
#  [ 0.53313271 -0.22997329]]
# <NDArray 4x2 @cpu(0)>

# [[-0.53452396 -1.54660926]
#  [-0.81901771 -0.55149847]]
# <NDArray 2x2 @cpu(0)>
ADD REPLY
1
Entering edit mode

The first two factor matrices should be identical up to scaling which is the case in your example. You can rescale any mode as long as the inverse scaling is applied to another mode, this doesn't change the solution to the factorization problem. Different libraries may handle the scaling differently. For example, you could also normalize the vectors in the factor matrices to length one and absorb the weights into the superdiagonal of the core tensor.

ADD REPLY
0
Entering edit mode

Thanks for being helpful with all of those questions. I still need to look more into the details but it's a tool that I didn't even know existed. Really inspiring.

ADD REPLY
0
Entering edit mode

I asked our mutual uncertainty here,

C: WGCNA modules and categorical traits relationship

A biostar thinks that no problem with Pearson correlation even for non-quantitative traits

ADD REPLY
0
Entering edit mode

Interesting answers. I will take a look at the regression. However, at that point it wouldn't be correlation values? Also, if you get some extra time can you take a look at this question about modulePreservation https://support.bioconductor.org/p/106661/

ADD REPLY
0
Entering edit mode

Kevin had used correlation itself instead of principal components in our case.

ADD REPLY
0
Entering edit mode

Sorry, may I please access to a sample of input file for tensor factorization as come in the workshop?

I tried to adopt your steps on my data but error says that;

> for (i in 1:length(fileList)){
+   
+   A[,,i] <- as.matrix(read.delim(file.path(dataDir,fileList[i]), sep = ';', header=TRUE, row.names=1))
+   
+ }
Error in A[, , i] <- as.matrix(read.delim(file.path(dataDir, fileList[i]),  : 
  incorrect number of subscripts
ADD REPLY
0
Entering edit mode

I think this is a programming error. How did you define A ? If you're using triple indexing, you need A to be a 3d array and assigning a matrix to an array slice works only if the array has been pre-allocated.

ADD REPLY
0
Entering edit mode

Thank you for paying attention,

I extracted two matrices from my data

 source("genie3.R")
    library(igraph)

    expr.matrix <- read.expr.matrix("bigm.txt", form="rows.are.genes")
    dim(expr.matrix)
    head(expr.matrix[,1:4])
                    X1       X2       X3       X4
    AT1G01010 10.85486 11.72960 10.74060 10.69040
    AT1G01020 11.04532 10.90914 11.15791 11.00391
    AT1G01030 10.43474 10.62670 10.35366 10.54900
    AT1G01040 11.21519 11.00323 11.27382 11.28544
    AT1G01050 12.11774 11.53287 11.82480 11.78413
    AT1G01060 10.38341 10.43773 10.42992 11.22040
    > 
> dim(expr.matrix)
[1] 4030  205
    expr.matrix=expr.matrix[1:100,1:10]
    weight.matrix <- get.weight.matrix(expr.matrix)
    link.list <- get.link.list(weight.matrix, report.max=1000)
    edge_listsi <- link.list[!duplicated(link.list),]

    Gsi <- graph.data.frame(edge_listsi,directed = F)

    Asi <- get.adjacency(Gsi,sparse = F,attr = "im",type = "both")

Then I wrote my adjacency to a csv file, I did this procedure two times to have two matrices for treatment and control

Then I run your code

library(rTensor)

 ## Where the data is

    ## This is a simulated dynamic graph with 100 nodes and 10 time points

dataDir <- "~/Documents/Bio-IT/courses/Data_integration_and_mining_with_graphs/dynamic_graph"

Get list of adjacency matrices

Each matrix is in a csv file

fileList <- dir(path=dataDir,pattern = ".csv")

Read all adjacency matrices into an array

A <- array(as.numeric(NA),dim=c(100,100,2)) # There are 2 matrices of size 100 x 100

for (i in 1:length(fileList)){

    A[,,i] <- as.matrix(read.delim(file.path(dataDir,fileList[i]), sep = ';', header=TRUE, row.names=1))

}
ADD REPLY
0
Entering edit mode

Dear Dr. Hériché I need your comment. I would like to compare two co-expression network belong two different cancers, Lung and Breast. can I map your propose approach(Tensor factorizations for dynamic graphs) to my research target? I appreciate if you share your comment with me.

Best Regards,

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6