Differential Gene Expression analysis in Bulk RNA Seq - using Count Matrix as input
3
0
Entering edit mode
9 months ago
applepie • 0

Hello everyone, I am going to do the differential gene expression (DEG) analysis in the bulk RNA seq data. The sample used are the NAFLD samples downloaded from the NCBI Gene Expression Omnibus (GEO) (link to the dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135251). When I attempted to download the datasets, I realized that there are so many Count Matrix provided (see the attached photo). Regarding this, I have several questions:

1) May I ask if it is normal to have so many count matrices there? 2) If Yes, which count matrix should I use for downstream DEG analysis by DESeq2? Or should I use all the count matrix to do the analysis?

Thank you!

enter image description here

DESeq2 BulkRNASeq • 1.2k views
ADD COMMENT
2
Entering edit mode
9 months ago
ATpoint 82k

Each file contains one column with the counts for that sample. You can load that all into R and combine into a single matrix of raw counts. For this, download via Select All, that will give a tarball (.tar). Unpack that tarball with tar xf that.tar. Then use this snipped in R:

# list all files from the tarball (unpack tarball in bash with tar xf tarball.tar)
listed <- list.files("/Users/atpoint/Downloads/data/", pattern="^GSM", full.names=TRUE)
listed <- grep("txt.gz$", listed, value=TRUE)

# load every single file
raw.counts <- lapply(listed, function(x){

  r <- read.delim(x, header=FALSE, row.names=1)
  colnames(r) <- gsub("\\.counts.*", "", basename(listed[1]))
  r

})

# combine
raw.counts <- do.call(cbind, raw.counts)
raw.counts[1:3,1:3]
raw.counts[1:3,1:3]
                GSM3998167_017-Ann-Daly_S1 GSM3998167_017-Ann-Daly_S1.1 GSM3998167_017-Ann-Daly_S1.2
ENSG00000000003                       2565                         2400                         2391
ENSG00000000005                          0                           14                            0
ENSG00000000419                        605                          525                          709

This you can then use for DE analysis via DESeq2/edgeR/limma...

ADD COMMENT
0
Entering edit mode

It was a great way of extracting count matrix for RNA-Seq.

thanks

ADD REPLY
0
Entering edit mode

it's custom to this dataset, cannot be generally applied since GEO is not uniform in terms of what is supplied in the supplementary files

ADD REPLY
1
Entering edit mode
9 months ago
Pei ▴ 170

I guess that each counts.txt.gz is just 1 sample. So you will find a total of 216 counts.txt.gzs. Each counts.txt.gz may be used as one column in your count matrix for the downstream DE analysis. Am I right?

ADD COMMENT
0
Entering edit mode
9 months ago
Ayeh • 0

in my mind, every count txt.gz file is for one sample and you can great count matrix for DEG analysis by combine txt files column-wise.

ADD COMMENT

Login before adding your answer.

Traffic: 2962 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6