Hi, I'm working with meta-transcriptomic expression data from a symbiotic species (host and algal symbiont) from ten experimental cohorts (islands) each with ~9 biological replicates. I am trying to generate a set of consensus modules (shared across all ten islands) that I can than use to compare each individual island to in a manner analogous to Part II of the WGCNA tutorial "Consensus analysis of female and male liver expression data." My trouble, however, is that during cohort-wise selection of a soft power threshold, several of my island cohorts do not reach scale free topology values of ~0.9. Thus, I'm unable to select an appropriate soft power threshold for construction of the consensus gene network as per the tutorial instructions. I assume this problem has something to do with the normalization protocol I've applied to my RNA-seq data, but I'm not sure how best to proceed. Any suggestions would be greatly appreciated! I've attached my relevant code below:
# Step 1 - Load holobiont normalized expression quantitation data (TPM)
RNAseq_TPM <- read.table(argsL$readCount, sep="\t", h=T, stringsAsFactors = F, row.names=1)
# Step 2 - Preprocessing to remove genes with fewer than 10 counts overall and with counts in less than 20% of colonies.
RNAseq_highCount <- RNAseq_TPM[rowSums(RNAseq_TPM) >= 10, ]
RNAseq_filt <- RNAseq_highCount[apply(RNAseq_highCount,1,function(x) sum(x==0))<ncol(RNAseq_highCount)*0.8,]
# Step 3 - Normalize expression data using the voom methodology (Law et al. 2014).
library(edgeR)
# specify groups to be compared (in this case islands)
ile <- substr(colnames(RNAseq_filt), 1, 3)
# Specify the model to be fitted.
# We do this before using voom since voom uses variances of the model residuals.
d0 <- DGEList(RNAseq_filt, group=ile, remove.zeros=T)
d0 <- calcNormFactors(d0)
keep <- rowSums(cpm(d0)>10) >= 20
d <- d0[keep, , keep.lib.sizes=FALSE]
dim(d) # number of genes remaining
mm <- model.matrix(~0 + ile)
RNAseq_voom <- voom(d, mm, plot = T)$E
# Step 4 - Preparing data for downstream analysis
# Split data into separate dataframes by Island
I06_RNAseq_voom <- RNAseq_voom[,grep("I06", colnames(RNAseq_voom))]
I10_RNAseq_voom <- RNAseq_voom[,grep("I10", colnames(RNAseq_voom))]
I15_RNAseq_voom <- RNAseq_voom[,grep("I15", colnames(RNAseq_voom))]
I04_RNAseq_voom <- RNAseq_voom[,grep("I04", colnames(RNAseq_voom))]
I01_RNAseq_voom <- RNAseq_voom[,grep("I01", colnames(RNAseq_voom))]
I03_RNAseq_voom <- RNAseq_voom[,grep("I03", colnames(RNAseq_voom))]
I05_RNAseq_voom <- RNAseq_voom[,grep("I05", colnames(RNAseq_voom))]
I07_RNAseq_voom <- RNAseq_voom[,grep("I07", colnames(RNAseq_voom))]
I08_RNAseq_voom <- RNAseq_voom[,grep("I08", colnames(RNAseq_voom))]
I09_RNAseq_voom <- RNAseq_voom[,grep("I09", colnames(RNAseq_voom))]
# Specify the number of dataframes (i.e., sets of data) created above:
nSets = 10
# For easier labeling of plots, create a vector holding descriptive names of the two sets.
setLabels = c("I06_RNAseq_voom", "I10_RNAseq_voom","I15_RNAseq_voom","I04_RNAseq_voom","I01_RNAseq_voom","I03_RNAseq_voom","I05_RNAseq_voom","I07_RNAseq_voom","I08_RNAseq_voom","I09_RNAseq_voom")
shortLabels = c("I06", "I10","I15","I04","I01","I03","I04","I05","I07","I08","I09")
# Create a dataframe containing the multi-set data:
multiExpr = vector(mode = "list", length = nSets)
multiExpr[[1]] = list(data = as.data.frame(t(I06_RNAseq_voom)))
names(multiExpr[[1]]$data) = rownames(I06_RNAseq_voom)
rownames(multiExpr[[1]]$data) = colnames(I06_RNAseq_voom)
etc.......
After generating this multiExpr object, I then proceed to step Tutorial step 2.a.1 Choosing the soft-thresholding power as follows:
powers = c(c(1:10), seq(12,20, by=2));
# Initialize a list to hold the results of scale-free analysis
powerTables = vector(mode = "list", length = nSets);
# Call the network topology analysis function for each set in turn
for (set in 1:nSets){
powerTables[[set]] = list(data = pickSoftThreshold(multiExpr[[set]]$data, powerVector=powers,verbose = 2)[[2]])}
However, a plot of the results suggests that my normalization methodology worked well for only three cohorts (I06, I10, and I15) whereas the others do not seem to reach appropriate scale free topology values.
Thanks in advance! Eric
Doesn't the tutorial mention to use rlog or vst normalized values as input?
It does, but DESeq2 does not accept TPM data as an input becasue the TPM expression count values aren't integers. Thus, I decided to go the voom transformation route.