Question

Differential Gene expression between low survival and high survival samples from RNAseq

2

Entering edit mode

6.1 years ago

lawarde.ankita1 ▴ 70

Hello,

I have one question regarding the approach of differential gene expression analysis from RNAseq data of cancer.

Approach:

1) To find out the Defferentially Expressed Genes between low survived and high survived samples of patients. (example: here in my case, patients survived less than 1 year and patients survived more than 3 years)

I am using GDC data of lung cancer (TCGA-LUAD) and using this above approach i want to find the DEGs. So for this analysis i am using and following TCGAbiolinks package and its tutorials. but even if the TCGAbiolinks analysis is giving the DEG number but the heatmap of those genes is not showing any differentiating pattern. So i did the same analysis using edgeR analysis and limma using the expression set obtained from summarizedExperiment for the DEG, but then this analysis doesn't give any DEGs between less than 1 year and more than 3 years samples of patients.

SO my question is,

1) Is this approach of finding DEGs between less than 1 year and more than 3 years is correct? or is there anything i am missing in the analysis?

Suggestions regarding this approach will be very helpful and it will surly help me to clear my confusions. I look forward to hear valuable suggestions.

Thank you in advance.

rna-seq R DEG Survival • 3.2k views

ADD COMMENT • link updated 2.8 years ago by Bagher • 0 • written 6.1 years ago by lawarde.ankita1 ▴ 70

0

Entering edit mode

Can you please detail further how you used edgeR? and the connection to limma..

ADD REPLY • link 6.1 years ago by roy.granit ▴ 880

0

Entering edit mode

Hello roy.granit,

here i provide with the R code i used for the DEG analysis usign edgeR, limma,

#RNAseq- analysis,
library(edgeR)
library(limma)
library(gplots)
library(org.Hs.eg.db) #for annotations
library(RColorBrewer)
library(TCGAbiolinks)
library(DT)

query.exp <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda", 
                       summarizedExperiment = TRUE)

save(luad.exp,file = "luad_exp.RData")

load(file = "luad_exp.RData") #brca.exp
luad.exp

library(SummarizedExperiment)
# get expression matrix
data <- assay(luad.exp)

I have clinical info from the datatype follow_up and i have made two files containing patient clinical data for days to death more than three years and less than three years. the following code i used to match those barcodes of patients with the sample barcodes.

more3_years <- read.delim(file = "more_3_years.txt", sep = "\t", header = TRUE)

less1_year <- read.delim(file = "less_than_or_equal_1_year.txt", sep = "\t", header = TRUE)

data_days_to_death_3year <- subset(data, select = substr(colnames(data),1,12) %in% more3_years$bcr_patient_barcode)

data_days_to_death_1year <- subset(data, select = substr(colnames(data),1,12) %in% less1_year$bcr_patient_barcode)

data_1year_3year <- cbind(data_days_to_death_1year,data_days_to_death_3year)

Following is the RNA seq analysis, this im following according to this tutorial ( http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html )

#Filtering to remove lowly expressed genes
myCPM <- cpm(data_1year_3year)
head(myCPM)

# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5

head(thresh)

table(rowSums(thresh))
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2

# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- data_1year_3year[keep,]

# DGElist object
yy <- DGEList(counts.keep)

#add annotations 
ensemble_id <- rownames(yy)

genes <-select(org.Hs.eg.db, keys=ensemble_id ,columns = "SYMBOL", keytype="ENSEMBL")

genes <- genes[!duplicated(genes$ENSEMBL),]
yy$genes <- genes

group <- c(rep("less_than_1",51),rep("more_than_3",40))
group <- factor(group)

#Normalisation for composition bias
# Apply normalisation to DGEList object
y <- calcNormFactors(yy)

#Differential expression with limma-voom

# Specify a design matrix without an intercept term
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
desig
#Voom transform the data
v <- voom(y,design,plot = TRUE)

#Testing for differential expression
# Fit the linear model

fit <- lmFit(v)
cont.matrix <- makeContrasts(diff= more_than_3-less_than_1, levels=design)
fit2 <- eBayes(contrasts.fit(fit, cont.matrix))
summa.fit <- decideTests(fit2)
summary(summa.fit)
dia_ven <- vennDiagram(summa.fit,include = c("up","down"))

tt_1 <- topTable(fit2, coef =   ,p.value = 0.05,lfc = 2, number = nrow(v$E))
dim(tt_1)

The topTable gives zero DEGs If this process to find DEG is wrong then please let me know that, it would really help me.

ADD REPLY • link 6.1 years ago by lawarde.ankita1 ▴ 70

0

Entering edit mode

Sorry to add this code as a separate comment, but the number of charcters in the post was exceeding than the 5000

Below is the code where i have followed TCGAbiolinks tutorial,

library(TCGAbiolinks)
library(DT)

query.exp <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda", 
                       summarizedExperiment = TRUE)


#get clinical info

query <- GDCquery(project = "TCGA-LUAD", 
                  data.category = "Clinical")

GDCdownload(query)

clinical <- GDCprepare_clinic(query, clinical.info = "follow_up")

datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)

names(clinical)

write.table(clinical, file = "clinical_followup.txt",sep = "\t")

more3_years <- read.delim(file = "more_3_years.txt", sep = "\t", header = TRUE)
colnames(more3_years)

less1_year <- read.delim(file = "less_than_or_equal_1_year.txt", sep = "\t", header = TRUE)
colnames(less1_year)

dataPrep <- TCGAanalyze_Preprocessing(object = luad.exp , cor.cut = 0.6, filename = "LUAD_correlation_sample.png")

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = TCGAbiolinks::geneInfoHT,
                                      method = "gcContent")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25) 

dataFilt_3year <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% more3_years$bcr_patient_barcode)
dataFilt_1year <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% less1_year$bcr_patient_barcode)
dim(dataFilt_1year)
dim(dataFilt_3year)

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt_3year,
                            mat2 = dataFilt_1year,
                            Cond1type = "more than 3 years",
                            Cond2type = "less than 1 year",
                            fdr.cut = 0.05 ,
                            logFC.cut = 2,
                            method = "glmLRT")

these two methods i used to find the DEG, if there is anything wrong with one or the method please let me know as this would help me clear my doubt. And please tell the approach I'm using to find the DEG is correct or not correct.

Thank you so much for the help.

ADD REPLY • link 6.1 years ago by lawarde.ankita1 ▴ 70

0

Entering edit mode

in this code :

#RNAseq- analysis,
library(edgeR)
library(limma)
library(gplots)
library(org.Hs.eg.db) #for annotations
library(RColorBrewer)
library(TCGAbiolinks)
library(DT)

query.exp <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda",summarizedExperiment = TRUE)

save(luad.exp,file = "luad_exp.RData")

load(file = "luad_exp.RData") #brca.exp
luad.exp

library(SummarizedExperiment)
# get expression matrix
data <- assay(luad.exp)
  i have this error : 
> save(luad.exp,file = "luad_exp.RData")
Error in save(luad.exp, file = "luad_exp.RData") : 
  object ‘luad.exp’ not found

can anyone help me ?

ADD REPLY • link updated 2.8 years ago by GenoMax 141k • written 2.8 years ago by Bagher • 0

0

Entering edit mode

save(luad.exp,file = "luad_exp.RData") Error in save(luad.exp, file = "luad_exp.RData") : object ‘luad.exp’ not found ?????

ADD REPLY • link 2.8 years ago by Bagher • 0

0

Entering edit mode

That error says you dont have "luad.exp" object in your environment, so R complaining and says it can not save an object which does not exist.

Also since you specified "save.filename = "TCGA_LUAD_Exp.rda" in your command luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda",summarizedExperiment = TRUE), there is no point to do save(luad.exp,file = "luad_exp.RData") at least at this step.

ADD REPLY • link 2.8 years ago by Hamid Ghaedi 3.2k

score 0 · Answer 1 · 2021-06-17

0

Entering edit mode

2.8 years ago

Bagher • 0

Thanks a lot Mr.Hamid Ghaedi but there was another problem :

data <- assay(luad.exp)

Error in h(simpleError(msg, call)) :

error in evaluating the argument 'x' in selecting a method for function 'assay': object 'luad.exp' not found

ADD COMMENT • link 2.8 years ago by Bagher • 0