Question

strange output from edgeR when searching differential expressed genes

0

Entering edit mode

5.5 years ago

tujuchuanli ▴ 100

HI, I want to find differential expressed genes in TCGA datasets by edgeR. I used RNA-seq data measured by RPKM as input. The below code is basically followed edgeR manual section4.2 (https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) combined with a post on Biostars (https://www.biostars.org/p/212200/).

#reading files
file1="cancer.exprs.txt"

file2="normal.exprs.txt"

cancer.exprs = read.table(file = file1,header=T,sep="\t")

normal.exprs = read.table(file = file2,header=T,sep="\t")

#combine cancer and normal expression data and filter by median of expression value
exprs=cbind(cancer.exprs,normal.exprs)

rownames(exprs) = rownames(cancer.exprs)

median = apply(exprs,1,median)

exprs.filter = exprs[which(median >= 0.1),]

#put the expression data into edgeR object
y = DGEList(counts = exprs.filter, genes = rownames(exprs.filter))

#construct design model
Tissue = factor(c(rep("T",dim(cancer.exprs)[2]),rep("N",dim(normal.exprs)[2])))

data.frame(Sample = colnames(y),Tissue)

design = model.matrix(~0+Tissue)

#compute differential expressed genes

y = estimateDisp(y, design, robust=TRUE)

y$common.dispersion

fit = glmFit(y, design)

lrt = glmLRT(fit)

topTags(lrt,n=100)

cancer.exprs.txt and normal.exprs.txt are two files containing expression value measured by RPKM from cancer and normal samples. In these two files each column represents each sample and each row represents each gene. The first column and row are sample id and gene id. After running the code, I find that all genes are up-regulated genes with very huge logFC (I think it is log foldchange). Even I used cancer.exprs.txt as both cancer and normal input file, this code also consider all genes are up-regulated genes with very huge logFC. What is the wrong with me? Thanks

RNA-seq edgeR • 1.5k views

ADD COMMENT • link updated 5.5 years ago by Michael 54k • written 5.5 years ago by tujuchuanli ▴ 100

0

Entering edit mode

I will try raw counts. But I think it is unreasonable to ban to use RPKM as input. Under this situation, I can do nothing but use raw counts? Is there some ways to use RPKM as input? Thanks

ADD REPLY • link 5.5 years ago by tujuchuanli ▴ 100

0

Entering edit mode

RPKM has not been "banned" as input. edgeR uses a negative binomial distribution to model read counts, because read counts follow reasonably well the negative binomial distribution (whereas RPKMs do not) . So it is unreasonable to use data that do not follow the assumptions of the model, and expect sensible results.

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

Well, I tried raw counts as input. However the outcome still show that all genes are up-regulated genes. I think it is not because of RPKM (Although using raw counts is correct). It is probably because of my code. Is there something wrong about my code? Thanks.

ADD REPLY • link 5.5 years ago by tujuchuanli ▴ 100

score 2 · Answer 1 · 2018-10-06

cancer.exprs.txt and normal.exprs.txt are two files containing expression value measured by RPKM from cancer and normal samples.

That's your problem. The EdgeR documentation is clear that the software expects raw read counts.

edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries