Tutorial:Network plot from expression data in R using igraph
1
128
Entering edit mode
6.4 years ago

[last update: January 27, 2020]

igraph allows you to generate a graph object and search for communities (clusters or modules) of related nodes / vertices. igraph is utilised in the R implementation of the popular Phenograph cluster and community detection algorithm (used in scRNA-seq and mass cytometry), and also in the popular scRNA-seq package Seurat.

The cluster detection algorithm in these packages is default to Louvain. To utilise this with this tutorial (Step 3), use cluster_louvain() in place of edge.betweenness.community()).

This tutorial will allow you to:

  1. create graph and spanning tree objects from a data-frame or -matrix of numerical data
  2. identify community structure (clusters or modules) in your graph and tree objects
  3. perform some basic downstream analyses on the tree objects

For simple testing, we'll use the estrogen microarray data-set in R and follow the tutorial from R Gentleman / W Huber. This dataset was looking at genes that respond to estrogen stimulation.

NB - the method can be applied to any type of numerical data, be it microarray, RNA-seq, metabolomics, qPCR, etc.




Step 1: read in, normalise, and identify genes with significant effects

NB - IF YOU ALREADY HAVE YOUR OWN DATA, PROCEED TO STEP 2


library(affy)
library(estrogen)
library(vsn)
library(genefilter)

datadir <- system.file("extdata", package="estrogen")
dir(datadir)
setwd(datadir)

# Read in phenotype data and the raw CEL files
pd <- read.AnnotatedDataFrame("estrogen.txt", header=TRUE, sep="", row.names=1)
a <- ReadAffy(filenames=rownames(pData(pd)), phenoData=pd, verbose=TRUE)

# Normalise the data
x <- expresso(
  a,
  bgcorrect.method="rma",
  normalize.method="constant",
  pmcorrect.method="pmonly",
  summary.method="avgdiff")

# Remove control probes
controlProbeIdx <- grep("^AFFX", rownames(x))
x <- x[-controlProbeIdx,]

# Identify genes of significant effect
lm.coef <- function(y) lm(y ~ estrogen * time.h)$coefficients
eff <- esApply(x, 1, lm.coef)
effectUp <- names(sort(eff[2,], decreasing=TRUE)[1:25])
effectDown <- names(sort(eff[2,], decreasing=FALSE)[1:25])
main.effects <- c(effectUp, effectDown)

# Filter the expression set object to include only genes of significant effect
estrogenMainEffects <- exprs(x)[main.effects,]

We now have the top 25 genes showing statistically significant effects in both directions and have filtered our expression object to include only these genes. The object that we'll use for further analysis is estrogenMainEffects

head(estrogenMainEffects)
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
36785_at    4257.2678   4127.1452     5783.745     4888.527   5936.3714
39755_at     699.1178    658.3222     2026.508     1869.705    468.5581
151_s_at    1517.0178   1378.5044     2892.287     2852.310   1372.7407
32174_at     441.2678    416.9485     1442.827     1253.596    257.8186
39781_at     536.2678    508.1147     1590.378     1340.251    472.4352
1985_s_at    750.1178    643.6604     1612.051     1406.858    306.4513



Step 2: create the graph and tree objects

In this example, we now have our data-frame/-matrix stored as estrogenMainEffects and will construct a graph adjacency object from this. estrogenMainEffects can be any numerical data-frame/-matrix from you own experiment(s).

library(igraph)

# Create a graph adjacency based on correlation distances between genes in  pairwise fashion.
g <- graph.adjacency(
  as.matrix(as.dist(cor(t(estrogenMainEffects), method="pearson"))),
  mode="undirected",
  weighted=TRUE,
  diag=FALSE
)

NB - if you have trouble creating the correlation matrix due to its size, take a look at 2 (b), here: Parallel processing in R

NB - Euclidean distances can also be used instead of correlation, but this changes the tutorial slightly:

#g <- graph.adjacency(
#  as.matrix(dist(estrogenMainEffects, method="euclidean")),
#  mode="undirected",
#  weighted=TRUE,
#  diag=FALSE
#)

# Simplfy the adjacency object
g <- simplify(g, remove.multiple=TRUE, remove.loops=TRUE)

# Colour negative correlation edges as blue
E(g)[which(E(g)$weight<0)]$color <- "darkblue"

# Colour positive correlation edges as red
E(g)[which(E(g)$weight>0)]$color <- "darkred"

# Convert edge weights to absolute values
E(g)$weight <- abs(E(g)$weight)

# Change arrow size
# For directed graphs only
#E(g)$arrow.size <- 1.0

# Remove edges below absolute Pearson correlation 0.8
g <- delete_edges(g, E(g)[which(E(g)$weight<0.8)])

# Remove any vertices remaining that have no edges
g <- delete.vertices(g, degree(g)==0)

# Assign names to the graph vertices (optional)
V(g)$name <- V(g)$name

# Change shape of graph vertices
V(g)$shape <- "sphere"

# Change colour of graph vertices
V(g)$color <- "skyblue"

# Change colour of vertex frames
V(g)$vertex.frame.color <- "white"

# Scale the size of the vertices to be proportional to the level of expression of each gene represented by each vertex
# Multiply scaled vales by a factor of 10
scale01 <- function(x){(x-min(x))/(max(x)-min(x))}
vSizes <- (scale01(apply(estrogenMainEffects, 1, mean)) + 1.0) * 10

# Amplify or decrease the width of the edges
edgeweights <- E(g)$weight * 2.0

# Convert the graph adjacency object into a minimum spanning tree based on Prim's algorithm
mst <- mst(g, algorithm="prim")

# Plot the tree object
plot(
  mst,
  layout=layout.fruchterman.reingold,
  edge.curved=TRUE,
  vertex.size=vSizes,
  vertex.label.dist=-0.5,
  vertex.label.color="black",
  asp=FALSE,
  vertex.label.cex=0.6,
  edge.width=edgeweights,
  edge.arrow.mode=0,
  main="My first graph")

j

Note that layout can be modified to various other values, including:

  • layout.fruchterman.reingold
  • layout.kamada.kawai
  • layout.lgl
  • layout.circle
  • layout.graphopt

For further information, see Graph layouts




Step 3: identify communities in the tree object based on 'edge betweenness'

mst.communities <- edge.betweenness.community(mst, weights=NULL, directed=FALSE)
mst.clustering <- make_clusters(mst, membership=mst.communities$membership)
V(mst)$color <- mst.communities$membership + 1

par(mfrow=c(1,2))
plot(
  mst.clustering, mst,
  layout=layout.fruchterman.reingold,
  edge.curved=TRUE,
  vertex.size=vSizes,
  vertex.label.dist=-0.5,
  vertex.label.color="black",
  asp=FALSE,
  vertex.label.cex=0.6,
  edge.width=edgeweights,
  edge.arrow.mode=0,
  main="My first graph")

plot(
  mst,
  layout=layout.fruchterman.reingold,
  edge.curved=TRUE,
  vertex.size=vSizes,
  vertex.label.dist=-0.5,
  vertex.label.color="black",
  asp=FALSE,
  vertex.label.cex=0.6,
  edge.width=edgeweights,
  edge.arrow.mode=0,
  main="My first graph")

aaa

Other community identification methods, including fastgreedy.community, can be found at Detecting Community Structure.




Step 4: further analyses

# Check the vertex degree, i.e., number of connections to each vertex
degree(mst)

# Output information for each community, including vertex-to-community assignments and modularity
commSummary <- data.frame(
  mst.communities$names,
  mst.communities$membership,
  mst.communities$modularity)
colnames(commSummary) <- c("Gene", "Community", "Modularity")
options(scipen=999)
commSummary
         Gene Community   Modularity
1    39755_at         2 -0.027161612
2    151_s_at         3 -0.006790403
3    32174_at         4  0.014033499
4    39781_at         2  0.034631055
5   1985_s_at         5  0.055454957
....

# Compare community structures using the variance of information (vi) metric (not relevant here)
# Community structures that are identical will have a vi=0
compare(mst.communities, mst.communities, method="vi")
[1] 0

Other metrics that one can observe include (check the igraph documentation):

  • hub score
  • closeness centrality
  • betweenness centrality

In a case-control study, a useful approach is to generate separate networks for cases and controls and then compare the genes in these based on these metrics.

microarray RNA-Seq R network • 50k views
ADD COMMENT
1
Entering edit mode

Suggestion: add microarray tag? Also include that in the title? Or are you using the microarray data only becuase it is easily available?

ADD REPLY
1
Entering edit mode

Thanks genomax. I've added microarray and rna-seq. Technically, the method can be applied to any type of data though. I originally pieced together the pipeline on metabolomics data.

ADD REPLY
1
Entering edit mode

Thanks @ Kevin.

ADD REPLY
0
Entering edit mode

Dear @Kevin Blighe, Hi and thank you. Can I use igraph for my Trinity de novo RNA-seq assembly? I have assembled my reads then performed DEG analysis using Trinity. which file I can use as input of igraph? Thanks again.

ADD REPLY
1
Entering edit mode

Dear Farbod,

All that you require is a data-frame or data-matrix of numerical data, preferably binomial distributed, as the method is based on Pearson correlation.

If your data-frame/-matrix is MyData, then the starting point is:

library(igraph)

#Create a graph adjacency based on correlation distances between genes in  pairwise fashion.
g <- graph.adjacency(as.matrix(as.dist(cor(t(MyData), method="pearson"))), mode="undirected", weighted=TRUE, diag=FALSE)

You can also use Euclidean distances instead of correlation, with:

g <- graph.adjacency(as.matrix(dist(MyData, method="euclidean")), mode="undirected", weighted=TRUE, diag=FALSE)
ADD REPLY
0
Entering edit mode

Thank you

Trinity produces several kind of matrix. for example the first line of "Trans_counts.TMM.EXPR.matrix" is as:

                             J1     J2        J3      M1        M2     M3

TRINITY_DN190458_c7_g4_i9    0.188   0.020   0.098   0.223   0.121   0.092

can I use that ?

ADD REPLY
1
Entering edit mode

It looks fine.

The code in the tutorial above can be used in very diverse ways and the meaning/significance of the graph will change depending on the data that you provide. If you are supplying it data from Trinity, then I presume that it is de novo transcriptome assembly transcripts? The graph of these would then show transcripts whose expression is highly positively or inversely correlated.

ADD REPLY
0
Entering edit mode

Yes it is de novo but

Please tell me what do you mean by "ranscripts whose expression is highly positively or inversely correlated"? are you telling me that the result graph would have some sort of skew or error? or it would be just a great candidate for my data?

ADD REPLY
1
Entering edit mode

Hey Farbod, the tutorial first generates a square correlation matrix of your data. It answers the question: 'Which genes are highly correlated with each other?' The graph object is then created from this correlation matrix. Thus, the edges in your graph will connect genes that are highly positively or inversely correlated. Closely related genes will also be grouped together in the same community.

You should probably ask yourself this question: why do you want to perform network analysis on your data and what do you believe it will add to your project?

In the example that I provide in the tutorial, I first filter the example dataset to include only genes that are statistically significantly related to treatment group and time (n=50). The resulting plot, then, shows the potential relationships between just these significant genes.

ADD REPLY
0
Entering edit mode

Thank you Kevin, I got some errors in step1 data normalization

x <- expresso(
+   a,
+   bg.correct=FALSE,
+   normalize.method="vsn",
+   normalize.param=list(subsample=1000),
+   pmcorrect.method="pmonly",
+   summary.method="medianpolish"
+ )

normalization: vsn

PM/MM correction : pmonly

expression values: medianpolish

normalizing...Error in do.call(method, alist(object, ...)) : could not find function "normalize.AffyBatch.vsn"

ADD REPLY
1
Entering edit mode

try loading library "vsn" (library(vsn))

ADD REPLY
0
Entering edit mode

Hi Mike, did you install the vsn package? - Does library(vsn) return an error?

If you already have your own data, then you can start from the second step 'Create the graph and tree objects'

ADD REPLY
0
Entering edit mode

yes I installed vsn and loaded library(vsn)

then I check normalize.AffyBatch.vsn, but I couldnt get it...

?normalize.AffyBatch.vsn

normalize.AffyBatch.vsn package:vsn R Documentation

Wrapper for vsn to be used as a normalization method with expresso

Description:

 Wrapper for ‘vsn2’ to be used as a normalization method with the
 expresso function of the package affy. **The expresso function is
 deprecated, consider using ‘justvsn’ instead**. The
 normalize.AffyBatch.vsn can still be useful on its own, as it
 provides some additional control of the normalization process
 (fitting on subsets, alternate transform parameters).
ADD REPLY
1
Entering edit mode

The normalisation method is not too important just for the tutorial. Could you try:

x <- expresso(a, bgcorrect.method="rma", normalize.method="constant", pmcorrect.method="pmonly", summary.method="avgdiff")
ADD REPLY
0
Entering edit mode

Thank you very much, Kevin

ADD REPLY
0
Entering edit mode

Then watch out for the control probes step!

Use this instead of the original line:

main.effects <- main.effects[-which(main.effects %in% c("AFFX-CreX-3_at","AFFX-CreX-5_at"))]
ADD REPLY
0
Entering edit mode

Hi again Mike, I now remove those control probes just after normalisation. So, they shouldn't be a problem. I also added a note about VSN possibly not working.

Thanks for your feedback!

ADD REPLY
0
Entering edit mode

Hi, I have used it on my whole Trinity matrix file just for test and I received the following error:

Error in cor(t("/home/trans_counts.TMM.EXPR.matrix"),  : 
  'x' must be numeric

is it some separator issues?

ADD REPLY
1
Entering edit mode

You may have to read the file 'home/trans_counts.TMM.EXPR.matrix' into a data-frame? Something like

MyData <- read.table("/home/trans_counts.TMM.EXPR.matrix", sep="\t") #assuming tab-delimited

g <- graph.adjacency(
  as.matrix(as.dist(cor(t(MyData), method="pearson"))),
  mode="undirected",
  weighted=TRUE,
  diag=FALSE
)
ADD REPLY
0
Entering edit mode

Thank you, there is another "x" and "numeric" issue:

Error in cor(t(MyData), method = "pearson") : 'x' must be numeric
ADD REPLY
0
Entering edit mode

Try as.matrix(as.dist(cor(t(data.matrix(MyData)), method="pearson")))

ADD REPLY
0
Entering edit mode

Hi, it seems that it is a problem with huge matrix file so I did it simple and just put 100 lines in the matrix. I want to know by which command I can create the plot? is it just plot() ?(sorry for simple questions)


library(igraph)
MyData <- read.table('/home/emami/Desktop/A1.matrix', sep="\t") #assuming tab-delimited
g <- graph.adjacency(
  as.matrix(as.dist(cor(t(data.matrix(MyData)), method="pearson"))),
  mode="undirected",
  weighted=TRUE,
  diag=FALSE
)
plot (g)
ADD REPLY
0
Entering edit mode

Yes, it will be difficult to produce a network of a large data-matrix because it has to compute a squared distance/correlation matrix. For example, if you tried it on a transcriptome of 20,000 protein-coding genes, you would have to generate a distance matrix of 400,000,000 data points.

You can generate a quick plot with just the base R plot() function, which will interpret the graph object. plot.igraph should also work. However, between the graph adjacency object and actually plotting the graph, there are many other important steps (as you can see above in the tutorial)

ADD REPLY
0
Entering edit mode

What is the significance of the Modularity column in the final commSummary data frame? To my understanding, modularity is a singular value that speaks to the entire network - why is each node assigned its own value?

ADD REPLY
0
Entering edit mode

In this context, it relates to the community structure, not the entire graph. Also, based on the fact that I have used edge.betweenness.community() for the purpose of identifying community structure in this tutorial, the modularity score relates to the maximum score achieved for each node when identifying the community structures.

modularity

Logical constant, whether to calculate the maximum modularity score, considering all possibly community structures along the edge-betweenness based edge removals.

[source: http://igraph.org/r/doc/cluster_edge_betweenness.html]

Does that help? - hope so.

ADD REPLY
0
Entering edit mode

Thank you very much for the code!

I have two very basic questions. 1. Why do you use as input the four conditions in "estrogenMainEffects"? 2. Don´t you should use just one condition to find the "transcription modules" in that particular scenario?

Thank you again for your contribution! Regards, Luis

ADD REPLY
1
Entering edit mode

Hey Luis! This is really just a simple tutorial to show one way of constructing a network / minimum spanning tree. It would indeed be interesting to construct the network separately for each of the 4 conditions, and to then compare the networks between these. There are many diverse ways to conduct a network analysis.

ADD REPLY
0
Entering edit mode

Very awesome code and tutorial! I've been thinking how can I investigate the network overlap and/or uniqueness? For example, for time-course study, that'll be very interesting if looks at network "hand-shake" between two continuous time point, e.g Day3 has own network and Day4 has its own network, in the same plot, but also shows some interactions (maybe gene expressed in both days, or the subnetwork btw two days because some genes have correlations). Do you have any idea of how to achieve that? Thanks.

ADD REPLY
0
Entering edit mode

Hey Joseph. Thank you! I have not included that type of functionality in this tutorial, and other network analysis tools neither have sufficiently tackled this issue.

My recommendation would be to compare genes by the following metrics:

  • Hub score (higher score = more influence)
  • Vertex degree (higher score = more influence)
  • Community members, i.e., other genes that appear in the same community as your gene of interest
  • 1st neighbours of your gene of interest (see http://igraph.org/r/doc/neighbors.html )

You can also directly compare 2 networks by the 'variance of information' (vi) metric (not relevant here). Identical graphs will have a vi=0

compare(g1, g2, method="vi")
ADD REPLY
0
Entering edit mode

get error info as: Error in i_compare(comm1, comm2, method) : (list) object cannot be coerced to type 'double'

I don't understand what's the problem.

ADD REPLY
0
Entering edit mode

Oh, my apologies. The compare() function is to compare community structures, like:

# create minimum spanning tree (MST) for graph1
mst1 <- mst(g1, algorithm="prim")
# identify communities in MST for graph1
comm1 <- edge.betweenness.community(mst1, directed=FALSE)

# repeat for graph2
mst2 <- mst(g2, algorithm="prim")
comm2 <- edge.betweenness.community(mst2, directed=FALSE)

# compare
compare(comm1, comm2, method="vi")

This just outputs a single number, though. If the community structures are identical, the value is 0.0.

ADD REPLY
0
Entering edit mode

Thanks! That sounds only indicates either identical or discrepancy without any further details. I wonder, we can make to networks, g1 and g2, the genes A, B, and C appear in both networks. Can we let them "cross-talk" vis the genes A, B, and C in the Cytoscape? or how to make two networks connected? Any comment?

ADD REPLY
0
Entering edit mode

Cytoscape NetworkAnalyzer plugin allows you to do many extra things like that. I believe you can export your igraph objects to Cytoscape with toCytoscape(igraphobj), from cyREST package

ADD REPLY
0
Entering edit mode

Hi, I tried "Step 2 Create the graph and tree objects" with my own microarray data. But values such as edgeweights and vSizes are all NA, and therefore I cannot get a plot with the error message of "need finite 'xlim' values. My own data is 182 obs. of 661 variables. Is there anything that I should check to solve my problem? Thanks.

ADD REPLY
0
Entering edit mode

Was the graph.adjacency() function successfully run?

ADD REPLY
0
Entering edit mode

Thank you Kevin, when I run graph.adjacency(), there was no error message. Although I viewed the list "g", it was difficult for me to know whether it was successful or not. Could you please let me know how to check the output of graph.adjacency()?

ADD REPLY
0
Entering edit mode

Hey, if you just type g at the console, it should display some textual information about the graph object.

Can you show the exact command that you used for graph.adjacency() ?

Your input data may also contain NAs somewhere. What is the source of your data?

ADD REPLY
0
Entering edit mode

Hi Kevin, the command for graph.adjacency and the output of typing "g" at the console is as below. I found that there are 196 NAs in my data. My data is normalized microarray data of blood from 661 human subjects. When there are NAs, do I need to add specific argument? I also found that when I entered " g <- simplify(g, remove.multiple=TRUE, remove.loops=TRUE)", there was an error message with "Error in simplify(g, remove.multiple = TRUE, remove.loops = TRUE) : unused arguments (remove.multiple = TRUE, remove.loops = TRUE)"

Thank you so much.

g <- graph.adjacency( as.matrix(as.dist(cor(t(datExpr_LG5), method="pearson"))), mode="undirected", weighted=TRUE, diag=FALSE )

g IGRAPH 6940ee1 UNW- 182 16471 -- + attr: name (v/c), weight (e/n) + edges from 6940ee1 (vertex names): [1] ID3--ANGPTL1 ID3--FAM129C ID3--SETBP1 ID3--AEBP1 ID3--CXXC5 ID3--CYB561A3 ID3--PLEKHA2 ID3--TIMELESS
[9] ID3--PMEPA1 ID3--CD200 ID3--CD79A ID3--ZCCHC18 ID3--FCRL5 ID3--ADAM28 ID3--ID3.1 ID3--ID3.2
[17] ID3--ADAM28.1 ID3--SWAP70 ID3--PKIG ID3--BCL11A ID3--CXXC5.1 ID3--TCF4 ID3--BCL11A.1 ID3--SIPA1L3
[25] ID3--FAIM3 ID3--BANK1 ID3--PIK3C2B ID3--ST6GAL1 ID3--POU2AF1 ID3--PKIG.1 ID3--TTN ID3--FCRL1
[33] ID3--PPAPDC1B ID3--PTPRK ID3--RIC3 ID3--VAV2 ID3--RRAS2 ID3--RNFT2 ID3--CNTNAP2 ID3--RIC3.1
[41] ID3--PTPRK.1 ID3--SWAP70.1 ID3--EBF1 ID3--DBNDD1 ID3--PPAPDC1B.1 ID3--BCL7A ID3--IGHV5.78 ID3--ZNF285
[49] ID3--MOB3B ID3--SIK1 ID3--TLR10 ID3--COBLL1 ID3--RIC3.2 ID3--PCDH9 ID3--FAM129C.1 ID3--LCN10
[57] ID3--E2F5 ID3--SNX22 ID3--RNFT2.1 ID3--STAP1 ID3--QRSL1 ID3--PMEPA1.1 ID3--RIC3.3 ID3--CR2
+ ... omitted several edges

ADD REPLY
0
Entering edit mode

If there are NAs in the data, then I am not sure how it manages to successfully produce the g object. The cor() function (with default settings) applied to data with missing values will throw an error. Perhaps it is producing a corrupt g object.

The key may be in the 'use' parameter, which is passed to cor().

You could try

g <- graph.adjacency( as.matrix(as.dist(cor(t(datExpr_LG5), use="pairwise.complete.obs", method="pearson"))), mode="undirected", weighted=TRUE, diag=FALSE )
ADD REPLY
0
Entering edit mode

Thanks! I tried the command as you suggested, but in the next step of "g <- simplify(g, remove.multiple=TRUE, remove.loops=TRUE)", I still get the error message with "Error in simplify(g, remove.multiple = TRUE, remove.loops = TRUE) : unused arguments (remove.multiple = TRUE, remove.loops = TRUE)". Could you please suggest any other method? I appreciate your help.

ADD REPLY
0
Entering edit mode

I have never seen that error for that function in the past. If you skip this step, are you able to produce the graph?

Other things to try:

  • what happens when you just type simplify ?
  • can you try g <- igraph::simplify(g, remove.multiple=TRUE, remove.loops=TRUE)
ADD REPLY
0
Entering edit mode

It works with your suggestion "g <- igraph::simplify(g, remove.multiple=TRUE, remove.loops=TRUE)". Thanks! Before you suggested the new graph.adjacency command, all values of both edgeweights and vSizes were NA. After I used your new graph.adjacency command, edgeweights are not NA. But all vSizes are still NA. I cannot produce the graph. Could you please suggest any other method in this situation?

ADD REPLY
0
Entering edit mode

Okay, that indicates that there is a namespace issue, i.e., there is some other package that is loaded in your R session that shares the same function names as those functions used by igraph package. You may consider saving your R session via save.image("MySession.rdata"), restarting R, loading just igraph package, and then loading your data with load("MySession.rdata") (and then trying the commands again).

ADD REPLY
0
Entering edit mode

Yes, the original "simplify" command works well when I load igraph only. However, vSizes are still NA. Maybe it is caused by NA in my data, as you suggested earlier. Could you let me know the method to deal with NA problem? Thanks!

ADD REPLY
0
Entering edit mode

For vSizes, you have to specify your original dataframe into the command. So, yours would be:

vSizes <- (scale01(apply(datExpr_LG5, 1, mean)) + 1.0) * 10

You can also just skip this part and specify your vertex sizes manually when calling the plot() function, eg:

vertex.size=3.0
ADD REPLY
0
Entering edit mode

Sure, I specified my original dataframe into the command. But it does not work. However, as you suggested, when I skipped that part and specified vertex sizes manually, it worked! Thank you so much. By the way, two thirds of my nodes don't have any edges. Could you let me know how to delete nodes without edges?

ADD REPLY
0
Entering edit mode

Great!

The nodes / vertices that have no edges are the ones whose original edge values fell below your chosen threshold for correlation. You can remove them using this function: http://igraph.org/r/doc/delete_vertices.html

ADD REPLY
0
Entering edit mode

OK, thank you very much for your kind help! I wish you success and happiness.

ADD REPLY
0
Entering edit mode

And you too, of course!

ADD REPLY
0
Entering edit mode

Thank you so much for this tutorial. I have applied it to my data and it works fantastic. I would like to ask about something though. I followed your directions exactly in creating a pearson correlation distance:

g <- graph.adjacency(
  as.matrix(as.dist(cor(t(POS_NSp), method="pearson"))),
  mode="undirected",
  weighted=TRUE,    
)

and the rest of the tutorial; however, I run into issues when I get to this command:

mst <- mst(g, algorithm="prim") 
mst.communities <- edge.betweenness.community(mst, directed=FALSE)

It generates the following error:

In edge.betweenness.community(mst, directed = FALSE) : At community.c:467 :Modularity calculation with weighted edge betweenness community detection might not make sense -- modularity treats edge weights as similarities while edge betwenness treats them as distances

If I read this correctly, this is saying that this command is not appropriate for weighted edges bc the edges are weighted and the command treats them as distances instead, so the groups you end up with in your graph may not be biologically relevant or even correct.

I have found this link explaining the issue as well: https://github.com/igraph/igraph/issues/1040

Do you happen to know a work-around for this? I have tried to set the weighted = NULL but I don't end up with any edges, just a bunch of balls floating in space.

Thank you so very much!!

ADD REPLY
0
Entering edit mode

That is a conundrum indeed! I read through the GitHub issue and it seems that there was no consensus on how to deal with this situation where edges are weighted and a user attempts to identify communities via edge betweenness. Well, the consensus appears to be to just issue a warning, which was not issued in the previous version of igraph on which I had built the tutorial.

The way around it (to make it through this tutorial at least) is is to just specify weights=NULL, which you tried for your own data. If you ended up with just the vertices and no edges, you may try to reduce the threshold that you use for delete_edges(). Perhaps your network is simply not very well connected.

There are many other community identification metrics, though. One quick and easy one is 'fast greedy':

mst.communities <- cluster_fast_greedy(mst)

That, when run with the subsequent commands on the tutorial data, produces:

kkk

One would have to look through each method and decide whether it is suitable for the underlying data. Some of the community identification methods are explained here: https://stackoverflow.com/questions/9471906/what-are-the-differences-between-community-detection-algorithms-in-igraph

ADD REPLY
0
Entering edit mode

Hi Kevin,

If I understand this correctly. I take my counts data and normalize it first e.g use rlog of the counts. Then I need to find the genes that have a significant effect as the code below. However, I am bit confuse of what I should input in where is says: "y ~ estrogen *time.h" . Is this meant to be one of my conditions?

lm.coef <- function(y) lm(y ~ estrogen * time.h)$coefficients
eff <- esApply(x, 1, lm.coef)

Thanks

ADD REPLY
0
Entering edit mode

Oh, you just need to start at Step 2. From that step, your rlog data should slot into the code where estrogenMainEffects is used. For this network analysis, though, you will struggle to work with 1000s of genes - it is mostly designed for smaller datasets. With your RNA-seq data, you would have to filter down based on variance on differential expression.

ADD REPLY
0
Entering edit mode

Hi Kevin, I have a very fundamental question about how to pre-processing the time course data. In your demonstration, you have several time points, my study is also time-course data. I'm very seriously considering should we subtract the baseline (e.g. 0hr, Day0 et al) from subsequent time points prior to construct the network. The major differences are if we take give time point expression value, that value number maybe around 10+ (as example), but if we use the expression data of DayX-Day0, the intensity value may go to 0.1, 0.5 et al and even with negative values. I don't know how that can impact on the network construction. However, the baseline value certainly influenced the latter time points. The variance from baseline would mislead us in the understanding of the network?

ADD REPLY
0
Entering edit mode

My preference would be to perform unbiased / unsupervised network construction separately at each time point (if possible), and then compare hub scores, vertex degrees, and module (community) memberships. These will likely differ across the different time-points.

For now, I have not worked specifically on time-course data in the context of network analysis.

ADD REPLY
0
Entering edit mode

Thank you, Kevin. In order to gain separate networks at each time point, will you suggest to use DayX-Day0 as input?

ADD REPLY
0
Entering edit mode

Yes:

1st network = Day -5
2nd network  = Day -4
3rd network = Day -3
4th network = Day -2
5th network = Day -1
6th network = Day 0
ADD REPLY
0
Entering edit mode

When I run the code

commSummary <- data.frame( + mst.communities$names, + mst.communities$membership, + mst.communities$modularity + )

I get this error:

Error in data.frame(mst.communities$names, mst.communities$membership, : arguments imply differing number of rows: 290, 288

ADD REPLY
0
Entering edit mode

Great tutorial, Kevin, thank you. However, I also have the same problem as Gabriel: modularity comes up short by one row. Do you know what is causing this or a solution? Thanks!

ADD REPLY
0
Entering edit mode

Can you view the contents of mst.communities just to check what could be happening? - like, display it on your screen or even write it to disk and view in Excel.

I believe the issue occurs when there is still a vertex (gene or protein) in your graph that has no edges to any other vertex.

ADD REPLY
1
Entering edit mode

Thanks for this: I had indeed omitted the removal of edgeless vertices.

ADD REPLY
0
Entering edit mode

might not be solving the root problem- but to display which exactly is the NA value and to create a df with unequal columns: here's what i did

max_length <- max(c(length(mst.communities$names), length(mst.communities$membership),length(mst.communities$modularity)))

commdata<- data.frame(names = c(mst.communities$names,                 
                            rep(NA, max_length - length(mst.communities$names))),
                   membership = c(mst.communities$membership,
                            rep(NA, max_length - length(mst.communities$membership))),
                  modularity = c(mst.communities$modularity,
                            rep(NA, max_length - length(mst.communities$modularity)))
                  )
ADD REPLY
0
Entering edit mode

Hi Kevin, I am wondering how you would remove the genes with no edges such as 36585_at in your example graphs. From the g or mst object?

Thanks!

ADD REPLY
1
Entering edit mode

Hey, can you possibly try this function: https://igraph.org/r/doc/delete_vertices.html ?

ADD REPLY
0
Entering edit mode

Hi Kevin,

Do you think is possible to use igraph if I have as input data a matrix with my genes of interest and their correlation values? I just want to do the network plot. My input data looks like this:

SAG 0.859557506
GRTP1   0.858435001
CABLES1 0.85801727
OSBP2   0.857637266
TMEM138 0.856453354
RHO 0.85296387
FABP12  0.849786065
CASQ1   0.84969198
AIPL1   0.849459516
GNGT1   0.845381864
CNGA1   0.845309857
EXOC6   0.843643641
PIWIL1  0.842648567
KCNV2   0.841007691
RAX2    0.839487872
MIR4730 0.83853894
CYB561A3    0.83783918
MFGE8   0.83550131
IGSF9   0.834742397
CABP4   0.832407545
PSD 0.828862757
AGPAT3  0.828571089
VIPR1   0.826306378
KRI1    0.826002416

Thanks!

ADD REPLY
1
Entering edit mode

If you already have a correlation matrix, then proceed from Step 2, but your first command would be something like:

g <- graph.adjacency(
  as.matrix(as.dist(x)),
  mode="undirected",
  weighted=TRUE,
  diag=FALSE)
ADD REPLY
0
Entering edit mode

Hello Kevin. Thank you for your great tutorial. But I have one question. I have 6 samples (3 controls and 3 treatment samples). I would like to do a co-expression plot to know which genes have changed I have the reads for each sample and the foldchange, do you think this can be used as part of your tutorial?

Thanks in advance.

ADD REPLY
0
Entering edit mode

Hi Kevin, I was running your excellent, very informative code, but noticed at the end of step 2 a warning; specifically when plotting the tree object. I know a warning is not an error, yet I found it remarkable so I had a closer look at it.

The warning:

> plot(
+   mst,
+   layout=layout.fruchterman.reingold,
+   edge.curved=TRUE,
+   vertex.size=vSizes,
+   vertex.label.dist=-0.5,
+   vertex.label.color="black",
+   asp=FALSE,
+   vertex.label.cex=0.6,
+   edge.width=edgeweights,
+   edge.arrow.mode=0,
+   main="My first graph")
Warning messages:
1: In layout[, 1] + label.dist * cos(-label.degree) * (vertex.size +  :
  longer object length is not a multiple of shorter object length
2: In layout[, 2] + label.dist * sin(-label.degree) * (vertex.size +  :
  longer object length is not a multiple of shorter object length
>

It turns out that this is due to the fact that when edges are being removed [the step @ delete_edges()], this results in 2 vertices (nodes) that are being removed, so g consists of 48 vertices (and not 50 anymore). When a scaling vector (=vSizes) for the vertices is calculated (scaling is proportional to expression level), the length of vSizes is still 50 (because object estrogenMainEffects consists of 50 probesets).

When g and vSizes are then combined in the plotting command of the last step of part 2, this results in the above warning.

I was able to 'eliminate' this warning by simply creating/adding the vSizes slot to the object g, before the delete_edges() step. By doing so, all slots are properly being subset when filtering. However, be sure to use/change the argument to vertex.size=V(g)$vSizes (from vertex.size=vSizes) when using the plotting command.

Maybe a suggestion to move up this part of code in your tutorial as well?

Thanks for all your work! G

> #<<snip>>
> 
> g <- graph.adjacency(
+   as.matrix(as.dist(cor(t(estrogenMainEffects), method="pearson"))),
+   mode="undirected",
+   weighted=TRUE,
+   diag=FALSE
+ )
> 
> 
> g <- simplify(g, remove.multiple=TRUE, remove.loops=TRUE)
> 
> # Colour negative correlation edges as blue
> E(g)[which(E(g)$weight<0)]$color <- "darkblue"
> 
> # Colour positive correlation edges as red
> E(g)[which(E(g)$weight>0)]$color <- "darkred"
> 
> # Convert edge weights to absolute values
> E(g)$weight <- abs(E(g)$weight)
> 
> **# Now already add size of the vertices!**
> scale01 <- function(x){(x-min(x))/(max(x)-min(x))}
> V(g)$vSizes <- (scale01(apply(estrogenMainEffects, 1, mean)) + 1.0) * 10
> 
> # Continue with tutorial code
> # Remove edges below absolute Pearson correlation 0.8
> g <- delete_edges(g, E(g)[which(E(g)$weight<0.8)])
> 
> # Remove any vertices remaining that have no edges
> g <- delete.vertices(g, degree(g)==0)
> 
> #<<snip>>
> 
> # Convert the graph adjacency object into a minimum spanning tree based on Prim's algorithm
> mst <- mst(g, algorithm="prim")
> 
> # Plot the tree object
> # Note that the argument vertex.size now has to be changed to: vertex.size=V(g)$vSizes
> plot(
+   mst,
+   layout=layout.fruchterman.reingold,
+   edge.curved=TRUE,
+   vertex.size=V(g)$vSizes,
+   vertex.label.dist=-0.5,
+   vertex.label.color="black",
+   asp=FALSE,
+   vertex.label.cex=0.6,
+   edge.width=edgeweights,
+   edge.arrow.mode=0,
+   main="My first graph")
> 
> # No warnings!
ADD REPLY
0
Entering edit mode

Hi Kevin Blighe, thank you very much. It has been incredibly helpful. Could you kindly guide me on constructing an adjacency matrix using nearest neighbor distances from the nn2 function in the RANN package? I'd like to confirm if my approach is correct:

neighbors <- nn2(combined_data, combined_data, k = 10)
adjacency_matrix <- as.dist(neighbors$nn.dists)

# Build a graph using the adjacency matrix
graph <- graph.adjacency(adjacency_matrix, mode = "undirected")

Your insights are greatly appreciated.

ADD REPLY
8
Entering edit mode
5.4 years ago

Also, as a complement, check out this tutorial on network visualization with R.

ADD COMMENT

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6