Series Matrix file

Question

Parsing values from GSE file

0

Entering edit mode

5.6 years ago

Natasha ▴ 40

I'm following the suggestions given here to parse the values of transcript array signals. I'm highly confused with the output of the command head(exprs(gse[[1]])).

Does it give the value in terms of log2 of the transcript array signal or the transcript array signal itself?

For example,gs <- getGEO("GSE30732",GSEMatrix=TRUE)

 head(exprs(gse[[1]]))
            GSM762810 GSM762811 GSM762812 GSM762813 GSM762814 GSM762815 GSM762816
    7892501  144.1970  154.9120  204.9800  170.7820  160.4320  177.0040  164.8360
    7892502   87.3393   66.1895  236.0300   67.5643  118.0320  110.5290  124.1790
    7892503   17.6620   25.2553   25.0095   35.4209   34.8682   16.8056   19.0721
    7892504  968.6140  552.2480  629.9800  877.5390 1119.0500  395.1390  657.9350
    7892505   48.9950   56.7481   42.8267   41.8391   42.8017   52.4028   40.8999
    7892506   17.0715   11.3087   87.3541   18.2260   33.5276   33.4526   36.1377
            GSM762817 GSM762818 GSM762819 GSM762820
    7892501  160.8070  192.1560   97.9872  256.8820
    7892502   93.8661  125.8370  109.0660  131.7860
    7892503   19.9689   24.9181   32.5494   31.5395
    7892504  934.4180  851.7840  803.8410  736.0790
    7892505   58.3426   46.8862   49.7869   50.8334
    7892506   29.2909   34.7242   35.1665   33.4917

Whereas ,for

gs <- getGEO("GSE3984",GSEMatrix=TRUE)

> head(exprs(gs[[1]]))
          GSM90867 GSM90868 GSM90869 GSM90870 GSM90871 GSM90872
244901_at 6.653998 7.054338 6.806130 6.829245 7.027696 7.344291
244902_at 5.288584 6.662941 6.267735 5.781530 6.039393 6.848940
244903_at 6.400026 7.275735 7.289969 6.682002 7.000932 7.482286
244904_at 6.089982 6.441873 6.638312 5.969956 6.633428 6.733617
244905_at 6.187861 6.272919 6.517373 6.599013 6.209060 6.415216
244906_at 7.462396 7.822089 7.852994 7.329415 7.663500 8.192281

The above output is with the probe ids and the values appear to be log2 values.

How can we parse the value of the transcript array signal and not the log2 values?

Many thanks

gene GEO Bioconductor • 3.4k views

ADD COMMENT • link updated 5.6 years ago by Kevin Blighe 87k • written 5.6 years ago by Natasha ▴ 40

Kevin Blighe · Answer 1 · 2018-09-01

1

Entering edit mode

5.6 years ago

Kevin Blighe 87k

Hey Natasha,

There was a similar question as recent as yesterday regarding this: A: How to get data from GEO

The values in exprs(gseEset[[1]]) are most likely normalised, log (base 2) (log2). I say 'most likely' because GEO cannot absolutely guarantee that the user uploaded the data as specified in the documentation; although, they do simple checks on the data distribution when it is submitted.

You should generate a boxplot of exprs(gseEset[[1]]) - if the sample distributions line up, then you can assume that they are normalised.

Series Matrix file

Here, we obtain the Series Matrix file via GSEMatrix=TRUE, as you have done, which is usually the normalised, log2 data:

gseEset <- getGEO("GSE30732", GSEMatrix=TRUE)

boxplot(exprs(gseEset[[1]]), outline=FALSE)

So, you can have high confidence that this is indeed the normalised, log2 data.

SOFT-formatted file

Another file, the SOFT-formatted file, is a compilation of what is usually normalised, unlogged data together with all metadata. It can be obtained simply with GSEMatrix=FALSE:

getGEO("GSE30732", GSEMatrix=FALSE)

Raw data

If you want the true raw values, then you should download the tar file from the accession home-page: GSE30732_RAW.tar That contains the CEL files, which you can normalise yourself.

Kevin

ADD COMMENT • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

HI Kevin, Many thanks for the response. Excuse me for the naive question.I'm a beginner in this field. Could you please explain what it exactly means when we say the data is "normalized"? Does the expression value given in log2 scale mean the data is normalized?

gse <- getGEO("GSE30732", GSEMatrix=FALSE)

This syntax doesn't allow me to use command like exprs(gse[[1]]). I obtain an error , this S4 class is not subsettable , the same error shown in the question posted in the link shared by you. I understand this command can be used when for matrix series files.

exprs(gse[[1]]) works for parsing the log2 scale values from matrix series files and Table(gds) works for parsing values from gds files. How do we parse normalised (unlogged) data from SOFT files? Could you please provide the syntax that has to be used to get the unlogged data? I wish to obtain the gene names and cell type description too.( For which you suggested pData (ESET[[1]]) in my previous post.)

What is the difference between normalised and in normalised data? To compare data from multiple studies, is it recommend to use the normalised data or should one use unnormalised data?

Is there any tutorial on how to normalize data from CEL files?

I had a chance to use GEO2R before, since I am trying to analyze data from many different experimental studies , I am trying to write codes to automate the analysis.

Many thanks

ADD REPLY • link 5.6 years ago by Natasha ▴ 40

0

Entering edit mode

Hey, oh, if you want a good introduction to microarrays, then this is a must read (the author is also a very decent guy): Microarray data normalization and transformation.

You could consider that there are 3 types of microarray expression values:

raw intensities

These are stored in CEL, TXT, or other (depending on the platform) raw data files and are literally measures of fluorescence. That is, on the actual microarray chip, when a probe hybridises with its target mRNA, it releases light which is detected by a detector, which then encodes this light / fluorescence numerically, giving the raw intensities.

Here is a chip image from one of my own previous experiments:

The part in the middle is the sample loading point. In the corners (and center, by the looks of it), there are control probes.

Chip designs and detector sensitivities change from platform / version to platform / version; thus, the range of raw data fluorescent intensities could differ a lot between, for example, an Agilent and Affymetrix microarray.

There is 1 sample per chip.

normalised intensities

In a typical experiment, we will typically have multiple samples, and thus there may exist experimental bias across all samples. On each individual chip, also, as the chips are literally giving off light, there will also always exist background 'noise'. The normalsiation process aims to tackle these types of biases. Probes also have biases in how they hybridise to their target cDNA due to the fact that AT and GC bases require different energy for binding, i.e., GC bias.

For microarrays, the most common form of normalisation is gcRMA (robust multiarray average with GC correction):

background correction
quantile normalisation
log2 transformation

Others include MAS5 and RMA, and others.

So, the log2 transformation is typically the final part of the normalisation process. Technically, you just need to do 2^(exprs(gseEset[[1]])) in order to reverse the log2 transformation, which should then give you the normalised, unlogged data.

------------------------------------------------------------

I don't actually know of anyone who routinely uses the SOFT files. You can get a better idea of what's in them by running:

gse <- getGEO("GSE30732", GSEMatrix=FALSE)
GSMList(gse)

I just checked for this particular experiment and both the SOFT and the Series Matrix files both contain the normalised, log2 data.

Is there any particular reason why you require the normalised, unlogged data? The issue with microarray analysis is that different platforms require different types of processing and may have different related R packages.

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, Apologize for the delay in my response. It took me sometime to understand what I was doing is wrong.

Thanks a lot for the link to the reference.

I was looking at the normalized, unlogged data to compare the expression value of a list of 30 genes from 10 different experimental studies performed using the same platform Affy U133 plus 2. I was having the impression that I can take the normalized values from different studies and make a comparison.

After reading the reference shared by you, I understand the normalization procedure(eg.mean/median) can vary .

I would like to ask you for a few clarifications,if you don't mind.

1.The normalized,unlogged or logged expression value given in these matrices(gseEset <- getGEO("GSE30732", GSEMatrix=TRUE) is an absolute value or a ratio(Is this the T_i value,a ratio, shown in reference)?

2.In a given study, with 20samples (10 disease,10 control)if one has to compute the average of expression values reported in say,control ---would it be meaningful to compute the mean of the normalized values reported in each sample for a given probe id? a) I am a little confused about how the normalization is actually done.Is the normalization factor (N_total) the same for all the samples in a study?(If my understanding is correct, it is the same)Please correct me if I am wrong.

3.If one has to compare the expression value across multiple studies,same platform, is it possible to combine the raw values and normalize? Is there any tool that can be used for this purpose?

Many thanks for your time and attention, Excuse me for the naive questions.

ADD REPLY • link updated 5.6 years ago by Kevin Blighe 87k • written 5.6 years ago by Natasha ▴ 40

1

Entering edit mode

1.The normalized,unlogged or logged expression value given in these matrices(gseEset <- getGEO("GSE30732", GSEMatrix=TRUE) is an absolute value or a ratio(Is this the T_i value,a ratio, shown in reference)?

This is actually a very good question and there is much confusion in the community, from what I can see at least. Some array platforms (e.g. 'two-colour' arrays) allow you to probe expression levels of 2 conditions / treatments on each physical microarray chip. In this situation, one colour of light represents one condition, and the other is the other [condition]. The derived expression levels are normalised and logged (base 2) ratios of Condition A intensity / Condition B intensity for each gene target. So, in this situation, the values that we receive are already log2 ratios.

On single colour arrays, which appear to be more common these days, the levels of expression are representative of a single condition and are thus absolute values. The process that I mention in my previous comment was more related to these single-colour arrays. This is in part why one has to carefully navigate the methods for microarray analysis, i.e., because one has to know which platform was used.

This publication explains the fundamental difference between single and two-colour arrays in the opening paragraphs:

"Prediction of clinical endpoints from microarray-based gene-expression measurements can be performed using either of two experimental procedures: (1) a one-color approach, in which a single RNA sample is labeled with a fluorophore (such as phycoerythrin, cyanine-3 (Cy-3) or cyanine-5 (Cy-5)) and hybridized alone to a microarray, or (2) a two-color strategy, in which two samples (usually a sample and a reference) are labelled with different fluorophores (for example, Cy-3 and Cy-5) and are then hybridized together on a single microarray. The resulting data are fundamentally different: while two-color arrays yield ratios of fluorescence intensities (that is, sample fluorescence/reference fluorescence), one-color arrays result in absolute fluorescence intensities, which are assumed to be monotonically (if not linearly) related to the abundance of mRNA species complementary to the probes on the array."

[from: Comparison of performance of one-color and two-color gene-expression analyses in predicting clinical endpoints of neuroblastoma patients]

The array in your study is Affymetrix Human Gene 1.0 ST Array, which is a very popular single colour gene expression array.

2.In a given study, with 20samples (10 disease,10 control)if one has to compute the average of expression values reported in say,control ---would it be meaningful to compute the mean of the normalized values reported in each sample for a given probe id? a) I am a little confused about how the normalization is actually done.Is the normalization factor (N_total) the same for all the samples in a study?(If my understanding is correct, it is the same)Please correct me if I am wrong.

To better understand the normalisation method, just look up quantile normalisation, which is by far the most common for microarrays. The logic behind it is quite simple.

If we have single colour arrays (10 disease; 10 control), we determine the mean for the purposes of determining the log2 ratios on a per gene basis. Also, what we do is fit a linear model on a per-gene basis, i.e., perform linear regression. From this linear model 'fit', we can derive a P value. Thus, we will have a log2 ratio and P value for each gene.

3.If one has to compare the expression value across multiple studies,same platform, is it possible to combine the raw values and normalize? Is there any tool that can be used for this purpose?

In this case, you can definitely just process the samples together; however, when performing the analysis, you should include a covariate relating to the different studies in the statistical model. Combining data from different platforms is obviously more difficult, but not impossible.

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k