Question

Consensus WGCNA: soft-threshold power selection

0

Entering edit mode

4.1 years ago

armstrong • 0

Hi, I'm working with meta-transcriptomic expression data from a symbiotic species (host and algal symbiont) from ten experimental cohorts (islands) each with ~9 biological replicates. I am trying to generate a set of consensus modules (shared across all ten islands) that I can than use to compare each individual island to in a manner analogous to Part II of the WGCNA tutorial "Consensus analysis of female and male liver expression data." My trouble, however, is that during cohort-wise selection of a soft power threshold, several of my island cohorts do not reach scale free topology values of ~0.9. Thus, I'm unable to select an appropriate soft power threshold for construction of the consensus gene network as per the tutorial instructions. I assume this problem has something to do with the normalization protocol I've applied to my RNA-seq data, but I'm not sure how best to proceed. Any suggestions would be greatly appreciated! I've attached my relevant code below:

# Step 1 - Load holobiont normalized expression quantitation data (TPM)

RNAseq_TPM <- read.table(argsL$readCount, sep="\t",  h=T, stringsAsFactors = F, row.names=1)

# Step 2 - Preprocessing to remove genes with fewer than 10 counts overall and with counts in less than 20% of colonies.

RNAseq_highCount <- RNAseq_TPM[rowSums(RNAseq_TPM) >= 10, ]
RNAseq_filt <- RNAseq_highCount[apply(RNAseq_highCount,1,function(x) sum(x==0))<ncol(RNAseq_highCount)*0.8,]

# Step 3 - Normalize expression data using the voom methodology (Law et al. 2014).

library(edgeR)

# specify groups to be compared (in this case islands)
ile <- substr(colnames(RNAseq_filt), 1, 3)

# Specify the model to be fitted. 
# We do this before using voom since voom uses variances of the model residuals.

d0 <- DGEList(RNAseq_filt, group=ile, remove.zeros=T)
d0 <- calcNormFactors(d0)
keep <- rowSums(cpm(d0)>10) >= 20
d <- d0[keep, , keep.lib.sizes=FALSE]
dim(d) # number of genes remaining

mm <- model.matrix(~0 + ile)
RNAseq_voom <- voom(d, mm, plot = T)$E

# Step 4 - Preparing data for downstream analysis

# Split data into separate dataframes by Island

I06_RNAseq_voom <- RNAseq_voom[,grep("I06", colnames(RNAseq_voom))]
I10_RNAseq_voom <- RNAseq_voom[,grep("I10", colnames(RNAseq_voom))]
I15_RNAseq_voom <- RNAseq_voom[,grep("I15", colnames(RNAseq_voom))]
I04_RNAseq_voom <- RNAseq_voom[,grep("I04", colnames(RNAseq_voom))]
I01_RNAseq_voom <- RNAseq_voom[,grep("I01", colnames(RNAseq_voom))]
I03_RNAseq_voom <- RNAseq_voom[,grep("I03", colnames(RNAseq_voom))]
I05_RNAseq_voom <- RNAseq_voom[,grep("I05", colnames(RNAseq_voom))]
I07_RNAseq_voom <- RNAseq_voom[,grep("I07", colnames(RNAseq_voom))]
I08_RNAseq_voom <- RNAseq_voom[,grep("I08", colnames(RNAseq_voom))]
I09_RNAseq_voom <- RNAseq_voom[,grep("I09", colnames(RNAseq_voom))]

# Specify the number of dataframes (i.e., sets of data) created above:

nSets = 10

# For easier labeling of plots, create a vector holding descriptive names of the two sets.

setLabels = c("I06_RNAseq_voom", "I10_RNAseq_voom","I15_RNAseq_voom","I04_RNAseq_voom","I01_RNAseq_voom","I03_RNAseq_voom","I05_RNAseq_voom","I07_RNAseq_voom","I08_RNAseq_voom","I09_RNAseq_voom")
shortLabels = c("I06", "I10","I15","I04","I01","I03","I04","I05","I07","I08","I09")

# Create a dataframe containing the multi-set data:

multiExpr = vector(mode = "list", length = nSets)

multiExpr[[1]] = list(data = as.data.frame(t(I06_RNAseq_voom)))
names(multiExpr[[1]]$data) = rownames(I06_RNAseq_voom)
rownames(multiExpr[[1]]$data) = colnames(I06_RNAseq_voom)
etc.......

After generating this multiExpr object, I then proceed to step Tutorial step 2.a.1 Choosing the soft-thresholding power as follows:

powers = c(c(1:10), seq(12,20, by=2));
    # Initialize a list to hold the results of scale-free analysis
      powerTables = vector(mode = "list", length = nSets);
    # Call the network topology analysis function for each set in turn
      for (set in 1:nSets){
      powerTables[[set]] = list(data = pickSoftThreshold(multiExpr[[set]]$data, powerVector=powers,verbose = 2)[[2]])}

However, a plot of the results suggests that my normalization methodology worked well for only three cohorts (I06, I10, and I15) whereas the others do not seem to reach appropriate scale free topology values.

Scale Free Topology

Thanks in advance! Eric

wgcna • 1.3k views

ADD COMMENT • link 4.1 years ago by armstrong • 0

0

Entering edit mode

Doesn't the tutorial mention to use rlog or vst normalized values as input?

ADD REPLY • link 4.1 years ago by piyushjo ▴ 700

0

Entering edit mode

It does, but DESeq2 does not accept TPM data as an input becasue the TPM expression count values aren't integers. Thus, I decided to go the voom transformation route.

ADD REPLY • link 4.1 years ago by armstrong • 0

score 1 · Answer 1 · 2020-03-31

1

Entering edit mode

4.1 years ago

andres.firrincieli 3.6k

I do not get the filtering steps 2 and 3. First, unless you have the exact lenght of the transcripts I would not use TPM on meta-transcriptomic data. Second, your transcripts are already scaled by a factor of 1 million (TPM) but then you use cpm() on TPM normalized read count to filter out low expressed genes. This could completely mess up the normalization.

Personally, I would start with the raw read count data, then use cpm() to filter low count genes and then use the vst() - rlog() transformation

ADD COMMENT • link 4.1 years ago by andres.firrincieli 3.6k

0

Entering edit mode

I see your point about doubling down on the "per million" normalization step. However, I started with the TPM matrix because using raw counts greatly reduced the scale free topology values for all cohorts. Here's an example using raw counts and vsd:

<h6>Step 1 - Load holobiont expression data</h6>

RNAseq_raw <- read.table(argsL$readCount,  h=T, stringsAsFactors = F, row.names=1)

<h6>Step 2 - Prepare DESeq Object</h6>

library("DESeq2")

colData <- data.frame(Colony = substr(unique(colnames(RNAseq_raw)), 1, 9), 
                      Ile = substr(unique(colnames(RNAseq_raw)), 1, 3))

dds <- DESeqDataSetFromMatrix(countData = RNAseq_raw, colData = colData, 
                              design = ~ Ile)

<h6>Step 3 - Prefilter data to remove low count genes and perform DESeq</h6>

dds <- estimateSizeFactors(dds)
dds <- dds[rowSums(counts(dds, normalized=TRUE)) >= 10,]

dds <- DESeq(dds)

<h6>Step 4 - Normalize expression data</h6>

RNAseq_vsd <- assay(varianceStabilizingTransformation(dds, blind=FALSE))

Using this normalized count matrix as the input for the scale free topology analysis yields more consistent curves between the cohorts, but no data set reaches the recommend 0.9 cutoff. Scale Free Topology After VSD Transformation

ADD REPLY • link 4.1 years ago by armstrong • 0

2

Entering edit mode

In my experience the recommended cutoff depends upon the data and is not a standard cut off for all data. You would have to set it depending where you see the knee in the distribution, the value just below it.

ADD REPLY • link 4.1 years ago by piyushjo ▴ 700