Question

Differential analysis with microarray data without idat or cel files

1

Entering edit mode

5.4 years ago

Biologist ▴ 290

Hi,

I have downloaded LncRNA Array 3.0 (ArrayStar) microarray data from GEO. The data is in txt format. I don't see any idat or cel files for analysis with Limma or other Bioconductor packages.

The data looks like this.

ID_REF        GSM1485226    GSM1485227  GSM1485228  GSM1485229
ASHGA5P000001   10.142583   10.281497   10.830916   11.722805
ASHGA5P000003   5.500927    5.998252    5.542863    6.128017
ASHGA5P000005   4.150758    6.711603    8.305518    2.3658721
ASHGA5P000008   5.0700736   5.804977    6.241566    5.4667664
ASHGA5P000009   9.3204  8.438274    9.545782    8.75934
ASHGA5P000014   5.382101    6.172776    6.154235    2.355728
ASHGA5P000016   7.0331154   7.4848285   8.088155    6.629342
ASHGA5P000019   5.7742567   5.851971    6.6640825   5.9498987
ASHGA5P000020   5.7120953   6.4586296   6.5912247   5.906252
ASHGA5P000021   7.779541    7.5118065   7.502316    7.2784443
ASHGA5P000022   6.2820516   7.2403784   7.0809016   6.71879
ASHGA5P000023   6.919316    7.3297606   5.7048583   7.0136003
ASHGA5P000024   7.922426    8.809063    7.959467    8.797879
ASHGA5P000026   6.2206936   8.015113    8.997898    6.9229474
ASHGA5P000027   4.791071    5.154705    5.441963    5.3858256
ASHGA5P000028   9.352753    7.015601    9.240454    10.099891

Among those 4 samples GSM1485226 and GSM1485229 are Normals. GSM1485227 and GSM1485228 are tumor samples.

I have seen many tutorials for reading the data from cel or idat files and doing differential analysis. Need some help in reading the above data and doing differential analysis.

thanq

R microarray differential analysis gene expression • 3.4k views

ADD COMMENT • link updated 5.4 years ago by Biostar 20 • written 5.4 years ago by Biologist ▴ 290

0

Entering edit mode

These look like normalized data. You need raw data for any sensible differential analysis

ADD REPLY • link 5.4 years ago by Santosh Anand 5.7k

0

Entering edit mode

I don't see any cel or idat files. I see a Raw_tar file for each sample. For one of the sample it looks like below.

ProbeName   PositionX   PositionY   gProcessedSignal    gProcessedSigError  gMedianSignal   gBGMedianSignal gBGPixSDev  gBGSubSignal    gIsPosAndSignif gIsWellAboveBG  SpotExtentX gBGMeanSignal
ASHGA5P058197   789.345 359.397 4.38E+00    4.39E+00    60  32  9.74E+00    -4.88818    0   0   25.1049 32.9626
ASHGA5P007773   809.714 359.696 1.14E+01    4.54E+00    73  32  9.71E+00    11.4289 1   0   24.8756 33.066
ASHGA5P031162   830.85  359.433 4.73E+00    4.42E+00    68.5    33  9.61E+00    4.73484 1   0   25.7804 33.2279
ASHGA5P041796   852.214 359.143 3.90E+04    3.90E+03    38635.5 33  9.64E+00    38993.9 1   1   25.1049 33.0961
ASHGA5P006930   873.411 359.625 8.15E+01    9.26E+00    144 32  9.76E+00    81.5364 1   1   24.8756 32.9215
ASHGA5P031496   894.897 359.397 3.74E+02    3.76E+01    441 33  9.83E+00    373.5   1   1   25.7804 33.2151
ASHGA5P050699   916.068 359.271 1.49E+03    1.49E+02    1550    33  9.93E+00    1486.34 1   1   26.0017 33.2473
ASHGA5P035298   936.776 359.522 1.54E+01    4.66E+00    76  33  9.88E+00    15.3829 1   0   27.0811 32.9683
ASHGA5P014867   958 359.5   4.38E+00    4.39E+00    58  32  9.87E+00    -3.51977    0   0   30  32.6846
ASHGA5P008172   979.024 360.548 2.98E+01    5.31E+00    91  32  9.92E+00    29.8146 1   1   21.1402 32.5183

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

0

Entering edit mode

This is Agilent Raw File. Have a look at aglip http://bioconductor.org/packages/release/bioc/html/agilp.html

ADD REPLY • link 5.4 years ago by Santosh Anand 5.7k

0

Entering edit mode

What they say here [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60689] is that data is processed by Arraystar Human LncRNA Array v3.0. I have seen all the links but nowhere I found that raw file for the data. The above data is what I have now.

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

0

Entering edit mode

In the same link, you can see

Platforms (1)   
GPL16956    Agilent-045997 Arraystar human lncRNA microarray V3 (Probe Name Version)

This tells you that this is Agilent microarray, which is different from Illumina IDAT or Affymetrix CEL files.

Bottom down in the same page, GSE60689_RAW.tar is the RAW files.

ADD REPLY • link 5.4 years ago by Santosh Anand 5.7k

0

Entering edit mode

Yes, when I downloaded that GSE60689_RAW.tar file the above data is what I see, that is what I said in my previous comment.

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

score 1 · Answer 1 · 2018-11-20

1

Entering edit mode

5.4 years ago

Kevin Blighe 87k

Depending on how you downloaded the data from GEO, the data may or may not be normalised - can you confirm how you obtained it? You can also check for normalisation via histograms and boxplots. If it has been normalised via RMA (thus, quantile normalisation), the boxplot will be the biggest teller as the distributions should line up fairly nice, like I show here: A: Hierarchical Clustering in single-channel agilent microarray experiment

Limma just fits a linear regression model independently for each gene based on whatever design formula you provide. So, you can either build your own linear regression models and derive p-values using lm(), or you can go ahead and use limma. For Limma, you just need any data-matrix of values. The design formula can be obtained in the same way as per the tutorials that you have seen.

Kevin

ADD COMMENT • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

I log transformed the data and using quantile normalisation and made a box plot. I see that distribution looks fine.

data <- read.csv("MicroarrayData.csv")
data2 <- data.frame(data[,-1], row.names=data[,1])
data3 <- log2(as.matrix(data2))

data4 <- normalizeQuantiles(data3)

w <- ExpressionSet(data4)
png("boxplot.png", width = 20, height = 10, units='in',res=400)
boxplot(exprs(w), las=2)
dev.off()

Plot looks like thisboxplot

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

1

Entering edit mode

That looks like it is quantile normalised indeed. If you go to GEO2R, you can also put your GEO accession number into the search box and then use the R code that is generated to also download the normalised data that way.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

You can also follow the ideas of Santosh Anand, too. You can usually download the raw data CEL files for a study. They may be stored in a tar.gz file, accessible from the main accession page.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

I didn't find any tar.gz file for that. I have seen all the links related to that data.

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

0

Entering edit mode

Sure will have a look. Now with this data how can I go for differential analysis. Bit confused with the Lima user guide. can you please give some example code for the differential analysis with the above data. thanq

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

1

Entering edit mode

Take a look at this reproducible example with a random dataset:

Create random data (20,000 genes x 20 samples), and metadata (group A and B)

options(scipen=10)
options(digits=6)

col <- 20
row <- 20000
mat <- matrix(
  rexp(col*row, rate = 0.1),
  ncol = col)
rownames(mat) <- paste0('gene', 1:nrow(mat))
colnames(mat) <- paste0('sample', 1:ncol(mat))

metadata <- data.frame(row.names = colnames(mat))
metadata$Group <- rep(NA, ncol(mat))
metadata$Group[seq(1,20,2)] <- "A"
metadata$Group[seq(2,20,2)] <- "B"

Now run limma:

require(limma)

#Set up the comparison matrix
design <- paste(metadata$Group, sep="")

design <- factor(design)
comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)

#Create a limma object
project.fitmodel <- lmFit(mat, comparisonmodel)
project.fitmodel.eBayes <- eBayes(project.fitmodel)

names(project.fitmodel.eBayes)

Make individual contrasts, fit the model, and adjust by empirical Bayes:

#A Vs B
limAVsB <- makeContrasts(Q1="A-B", levels=comparisonmodel)
limAVsB.fitmodel <- contrasts.fit(project.fitmodel.eBayes, limAVsB)
limAVsB.fitmodel.eBayes <- eBayes(limAVsB.fitmodel)

Output all stats; no P value adjustment:

topTable(limAVsB.fitmodel.eBayes, coef="Q1", adjust="BH", number=10, p.value=1)
             logFC AveExpr        t      P.Value adj.P.Val        B
gene1540  -15.1577 10.8971 -4.63152 0.0000813433  0.958484 -4.59151
gene7095  -16.4871 14.6051 -3.89144 0.0005865091  0.958484 -4.59225
gene19392 -13.0810 10.4367 -3.81958 0.0007086020  0.958484 -4.59233
gene12477 -12.7225 11.4939 -3.78238 0.0007812551  0.958484 -4.59236
gene19525  12.7596 10.0269  3.77237 0.0008020231  0.958484 -4.59238
gene15958 -11.6651  7.9592 -3.76384 0.0008201435  0.958484 -4.59238
gene991   -19.3632 17.1643 -3.75441 0.0008406504  0.958484 -4.59239
gene3818   19.2614 15.5701  3.72067 0.0009181952  0.958484 -4.59243
gene12920 -21.1039 16.0520 -3.58393 0.0013104355  0.958484 -4.59258
gene806    13.6118 11.6507  3.57390 0.0013449112  0.958484 -4.59259

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you very much Kevin. A small help. In the above data in my question the how can I convert those probes to gene symbols?

ADD REPLY • link 5.4 years ago by Biologist ▴ 290